http://wiki.old.lustre.org/api.php?action=feedcontributions&user=Lydia&feedformat=atomObsolete Lustre Wiki - User contributions [en]2024-03-29T05:21:19ZUser contributionsMediaWiki 1.35.5http://wiki.old.lustre.org/index.php?title=Architecture_-_Lustre_Logging_API&diff=9753Architecture - Lustre Logging API2007-09-30T07:14:45Z<p>Lydia: /* Deletion of files. */</p>
<hr />
<div>Lustre Logging API <br />
== Introduction ==<br />
Lustre needs logging API in numerous places -orphan recovery, RAID1 synchronization, configuration, all associated with update of persistent information on multiple systems.<br><br />
Generally, logs are written transactionally and cancelled when a commit on another system completes. <br />
<br />
Log records are stored in log objects. Log objects are currently implemented as files, and possibly, in some minor ways, the APIs below reflect this. In this discussion, we speak of log objects and sometimes of llogs (lustre-logs).<br />
<br />
== API Requirements ==<br />
Some of the key requirements of these APIs that defines their design are: <br />
<br />
* The API should be usable through methods <br />
* The methods should not reveal if the API is being used locally or invoked remotely <br />
* Logs only grow <br />
* Logs can be removed, remote callers may not assume that open logs will remain available <br />
* Access to logs should be through stateless APIs that can be invoked remotely <br />
* Access to logs should go through some kind of authorization/authentication system<br />
<br />
== Fundamental data structures ==<br />
=== Logs.===<br />
(1) <br />
Log objects can be identified in two ways <br />
<blockquote><br />
(a) <br />
Through a name -The interpretation of the name is upto the driver servicing the call. Typical examples of named logs are files identified by a path name, text versions of the UUIDs, profile names. <br />
<br><br />
(b) <br />
Through an object identifier or llog-log identifier -A directory of llogs which can lookup a name to get an id can provide translation from naming system to an id based system. In our implementation, we use a file system directory to provide this catalog function. <br />
</blockquote><br />
<br />
(2) <br />
Logs only contain records <br />
<br />
(3) <br />
Records in the logs have the following structure: <br />
* llog_rec_hdr -a header, indicating the index , length and type. The header is 16 bytes long <br />
* Body which is opaque, 32-bit aligned blob <br />
* llog_rec_tail -length and index of recors for walking backwards, it is 16 byte long <br />
(4) <br />
The first record in every log is a 4K long llog_log_rec. The body of this record contains: <br />
* a bitmap of records that have been allocated; bit 0 is set immediately because the header itself occupies it <br />
* A collection of log records behind the header <br />
(5) <br />
Records can be accessed by : <br />
* iterating through a specific log <br />
* providing a llog_cookie, which contains the struct llog_logid of the log and the offset in the log file where the record resides. <br />
(6) <br />
Some logs are potentially very large, for example replication logs, and require a hierarchical structure. A catalog of logs is held at the top level. In some cases the catalog structure is two levels deep: <br />
* A catalog API is provided which exploits the lower lustre log API <br />
* Catalog entries are log entries in the catalog log which contain the log id of the log file concerned.<br />
<br />
=== Logging contexts.===<br />
Each obd device has an array of logging contexts (struct llog_ctxt). The contexts contain: <br><br><br />
(1) The generation of the logs. This is a 128 bit integer consisting of the mount count of the origianating device and the connection count to the replicators. <br><br />
(2) A handle to an open log (struct llog_handle *loc_handle) <br><br />
(3) <br />
A pointer to the logging commit daemon (struct llog_canceld_ctxt *loc_llcd) <br><br />
(4) <br />
A pointer to the containing obd (struct obd_device *loc_obd)<br> <br />
(5) <br />
An export to the storage obd for the logs (struct obd_export *loc_exp) <br><br />
(6) <br />
A method table (struct llog_operations *loc_logops) <br />
<blockquote><br />
'''lop_destroy:''' destroy a log <br><br />
'''lop_create:'''create/open a log <br><br />
'''lop_next_bloc:''' read next block in a log <br><br />
'''lop_close:''' close a log <br><br />
'''lop_read_header:''' read the header in a log <br><br />
'''lop_setup:''' set up a logging subsystem <br><br />
'''lop_add:''' add a record to a log <br><br />
'''lop_cancel:''' cancel a log record <br><br />
'''lop_connect:''' start a logging connection. This is called by the originator to initiate cancellation handling and log recovery processing on the replicators side. The originator calls this from a few places in the recovery state machine.<br><br />
'''lop_write_rec:''' write a log record <br />
</blockquote><br />
<br />
== Llog connections and the Cancellation API ==<br />
This section describes the typical use of the logging API to manage distributed commit of related persistent updates. The next section describes the recovery in case of netowrk or system failures. We consider systems that make related updates and use the following definitions: <br />
<blockquote><br />
'''Originator:''' -the first system performing a transaction <br><br />
'''Replicators:''' -one or more other systems performing a related persistent update <br />
</blockquote><br />
The key requirement is that the replicators must complete their updates if the originators do, even if the originating systems crash or the replicators roll back. Note that we do not require that the the system remains invariant under rollback of the originator. <br />
<br />
This goal is achieved by transactionally recording the originators action in a log. When the replicators related action commits, it cancels the log entry on the originator. In the subsequent sections, we describe the handshake and protocols involved. <br />
<br />
=== Llog connections.===<br />
In order to process cancellation and recovery actions, the originators and replicators use a ptlrpc connection to execute remote procedure calls. The connection can be set up on the originator or the replicator and we call the system setting up the connection the initiator and the target of that connection event the receptor. <br />
The connection is used symmetrically, that is, the originator and replicator can either be the initiator or the receptor. The obd device structure has an optional llog_obd_ctxt which holds a pointer to the import to be used for queuing rpc’s. <br />
<br />
* The originator and the replicator establish a connection. These are the usual connections used by other subsystems. <br />
* The logging subsystem on the originator uses the lop_connect method to the replicator. The lop connect call sends the logid’s of the open catalog from the originator to the replicator. <br />
* Just prior to sending this the originator context increases its generation, and includes the generation and the logid in the lop_connect method, usually calling llog_orig_connect. <br />
* The replicator now receives a llog_connect RPC. The handler is the replicators lop_connect (usually llog_repl_connect). This method first increases the llcd’s generation then initiates processing of the logs.<br />
<br />
=== The cancellation daemon.===<br />
<br />
A replicator runs a subsystem responsible for collecting pages of cookies and sending them to the originator for cancellation of the origin log records. This is done as a side effect of committing the replicating transaction on the replicator. <br />
<br />
A key element in the cancellation is to distinguish between old and new cookies. Old cookies are those that have a generation smaller than the current generation, new cookies have the current generation. The generation is present in the llog_context, hence it is both on the server and on the client. <br />
The cancellation context is responsible for the queueing of cancel cookies. For each originator it is in one of two states: <br />
<blockquote><br />
(1) <br />
Accepting cookies for cancellation <br />
<br><br />
(2) <br />
Dropping cookies for cancellation <br />
</blockquote><br />
<br />
The context switches from 1 to 2 if a timeout occurs on the cancellation rpc. It switches from 2 to 1 in two cases: <br />
<blockquote><br />
(1) <br />
A cookie is presented with an llog_generation bigger than the one held in the context <br />
<br><br />
(2) <br />
The replicator receives a llog_connect method (which will also carry a new llog_generation) <br />
</blockquote><br />
<br />
The llog_generation is an increasing sequence of 128 bit integers with highest order bits the boot count of the originator and the lower bits the obd_conncnt between the originator and the replicator. The originator increases its generation just before sending the llog_connect call, the replicator increases it just prior to beginning the handling of recovery when receiving an llog_connect call.<br />
<br />
=== Normal operation.===<br />
Under normal operation, the originator performs a transaction and as a part of the transaction, writes a log record for each replicator. The following steps are then followed to ensure that the replicator is updated with a copy: <br />
<br />
* The log record creation, done with lop_add produces a log_cookie <br />
* The log_cookie is sent to the replicator, through a means that we do not discuss here. <br />
* The replicator performs the related transaction and executes a commit callback for that. The callback indicates that the log_cookie can be put up for cancellation. The function lop_cancel is responsible for this queuing of the cancellation. <br />
* When the replicator has a page full of cancellation cookies, it sends the cookies to the originator <br />
* The originator cancels the the log records associated with the cookies and cleans up the empty log files. The handling function is llog_handle_cancel and it invokes the originators lop_cancel functions to remove the log record. <br />
<br />
<br />
The replication scenarios are closely related to commit callbacks and RPCs, the key differences are: <br />
<br />
* The commit callbacks with transaction numbers involve a volatile client and a persistent server <br />
* The transaction sequence is determined by the server in the voilatile-persistent case by the originator in the replicating case<br />
<br />
=== Examples. ===<br />
==== Deletion of files.====<br />
Change needs to be replicated from MDS (originator) to OST’s (replicators): <br />
<blockquote><br />
* The OSC’s used by the LOV on the MDS act as originator for the change log, using the storage and disk transactions offered by the MDS: <br />
<br />
– <br />
OSC’s write log records for file unlink events. This is done through an obd api which stacks the MDS on the LOV on the OSC’s. Such events are caused by unlink calls, by closing open but unlinked files, by removing orphans (which is recovery from failed closes) and by renaming inodes when they clobber. <br />
<br><br />
– <br />
The OSC’s create cookies to be returned to OSTs. These cookies are piggy backed on the replies of unlink, close and rename calls. In the case of removing orphans the cookies are passed to obd_destroy calls executed on the MDS. <br />
<br><br />
<br />
* OST’s act as replicators, they must delete the objects associated with the inode. <br />
<br />
– <br />
Remove objects <br />
<br><br />
– <br />
Pass OSC generated cookies as parameters to obd_destroy transactions <br />
<br><br />
– <br />
Collect cookies in pages for bulk cancellation RPCs to the OSC on MDS <br />
<br><br />
– <br />
Cancel records on the OSCs on MDS<br />
<br />
==== File size changes. ====<br />
* Changes originate on OSTs, these need to be implemented on the MDS <br />
<blockquote><br />
– <br />
Upon the first file size change in an I/O epoch on the OST: <br />
* Writes a new size changes record for new epoch <br />
* Records the size of the previous epoch in the record <br />
* Records the object id of the previous epoch in the record <br />
* It generates a cancellation cookie <br />
<br />
– <br />
When MDS knows the epoch has ended: <br />
<br />
*It obtains the size at completion of the epoch from client (or exceptionally from the OST) <br />
* It obtains cancellation cookies for each OST from the client or from the OSTs <br />
* It postpones starting a new epoch untill the size is known <br />
* It starts a setattr transaction to store the size <br />
* When it commits, it cancels the records on the OSTs <br />
</blockquote><br />
<br />
==== RAID1 OST. ====<br />
* The primary is the originator, the secondary is the replicator <br />
:– Writes on the primary are accompanied by a change record for an extent<br />
<br />
=== Cancellation timeouts.===<br />
<br />
If the replicator times out during cancellation, it will continue to process the transactions with cookies. The cancellation context will drop the cookies. <br />
<br />
The timeout will indicate to the system that the connection must be recovered.<br />
<br />
== Llog recovery handling ==<br />
When the replicator recieves an llog_connect rpc, it increases the llcd’s generation, and then spawns a thread to handle the processing of catalogs for the context. For each of the catalogs it is handling, it fetches the catalog’s log_id through an obd_get_cat_info call. When it has received the catalog logid, the replicator calls sync and proceeds with llog_cat_process <br />
<br />
<blockquote><br />
<br />
* It only processes records in logs from previous log connection generations. <br />
* The catalog processing repeats operations that should have been performed by the initiator earlier <br />
:– The replicator must be able to distinguish: <br />
<blockquote><br />
'''Done:''' If the operation already took place. If so it queues a commit cancellation cookie which will cancel the log record which it found in the catalog’s log that is being processed. Because sync was called there is no question that this cancellation is for a committed replicating action. <br />
<br><br />
'''Not done:''' The operation was not performed, the replicator performs the action, as it usually does, and queues a commit cookie to initiate cancellation of the log record. <br />
</blockquote><br />
* When log processing completes, an obd-method is called to indicate to the system that logs have been fully processed. In the case of size recovery, this means that the MDS can resume caching file sizes and guarantee their correctness.<br />
<br />
</blockquote> <br />
=== Log removal failure.===<br />
<br />
If an initiator crashes during log removal, the log entries may re-appear after recovery. It is important that the removal of a log from a catalog and the removal of the log file are atomic and idempotent. Upon re-connection, the replicator will again process the log. <br />
=== File size recovery. ===<br />
<br />
The recovery of orphan deletion is adequately described by 1.5.1. In the case of file size recovery, things are more complicated.<br />
<br />
== Llog utility and OBD API ==<br />
=== Llog OBD methods. ===<br />
<br />
There is only one obd method related to llog which llog_init. <br />
<br />
=== llog_init.===<br />
<br />
This obd method initializes the logging subsystem for an obd. It sets the methods and propages calls to dependent obd’s. <br />
<br />
=== llog_cat_initialize.===<br />
There is a simple master function '''llog_cat_initialize''' for catalog setup that uses and array of object id’s stored on the storage obd of the logging. The logids are stored in an array form and given to the llogging contexts during the '''lop_setup''' calls made by '''llog_init.''' It uses support from lvfs to read and write the catalog entries and create or remove them.<br />
<br />
== Log method table API ==<br />
Logs can be opened and/or created, this fills in a log handle. the log handle can be used through the log handle API. <br />
=== llog_create.=== <br />
==== Prototype.==== <br />
int llog_create(struct obd_device *obd, struct llog_handle **, struct llog_logid *, char <br />
<br />
==== Parameters.====<br />
==== Return Values.====<br />
==== Description.====<br />
If the log_id is not null, open an existing log with this ID. If the name is not NULL, open or create a log with that name. Otherwise open a nameless log. The object id of the log is stored in the handle upon success of opening or creation. <br />
<br />
<br />
=== llog_close. ===<br />
==== Prototype. ====<br />
: int llog_close(struct llog_handle *loghandle); <br />
<br />
==== Parameters. ====<br />
==== Return Values. ====<br />
==== Description.====<br />
Close the log and free the handle. remove the handle from the catalog’s list of open handles. If the log has a flag set of destroy if empty, the log may be zapped. <br />
<br />
<br />
=== llog_destroy. ===<br />
==== Prototype. ====<br />
: int llog_destroy(struct llog_handle *loghandle); <br />
<br />
==== Parameters.==== <br />
==== Return Values.====<br />
==== Description.====<br />
Destroy the log object and close the handle. <br />
<br />
<br />
=== llog_write_rec.=== <br />
==== Prototype. ====<br />
int llog_write_rec(struct llog_handle *handle, struct llog_reec_hdr *rec, struct llog_cookie <br />
<br />
==== Parameters.====<br />
<br />
<br />
==== Return Values. ====<br />
==== Description.====<br />
Write a record in the log. If buf is NULL, the record is complete. If ''buf'' is not NULL, it is inserted in the middle. Records are multiple of 128bits in size and have a header and tail. Write the cookie for the entry into the cookie pointer.<br />
<br />
=== llog_next_block.=== <br />
==== Prototype.====<br />
int llog_next_block(struct llog_handle *h, int curr_idx, int next_idx, __u64 *offset, <br />
<br />
==== Parameters. ====<br />
==== Return Values. ====<br />
==== Description.====<br />
Index curr_idx is in the block at *''offset''. Set *''offset'' to the block offset of recort ''next_idx''. Copy ''len'' bytes from the start of that block into the buffer ''buf''.<br />
<br />
=== lop_read_header.=== <br />
==== Prototype.==== <br />
int *lop_read_header(struct llog_handle *loghandle); <br />
<br />
==== Parameters.==== <br />
==== Return Values.==== <br />
==== Description.====<br />
Read the header of the log into the handle and also read the last ''rec_tail'' in the log to find the last index that was used in the log.<br />
<br />
=== llog_init_handle.=== <br />
==== Prototype. ====<br />
int llog_init_handle(struct llog_handle *handle, int flags, struct *obd_uuid); <br />
<br />
==== Parameters.==== <br />
==== Return Values.====<br />
==== Description.====<br />
Initialize the handle, try to read it from the log file. But if the log does not have a header built, build it from the arguments. If the header is read, verify the flags and UUID in the log equal those of the arguments. <br />
<br />
<br />
=== llog_add_record. ===<br />
<br />
=== Prototype. ===<br />
int llog_add_record(struct llog_handle *cathandle, struct llog_trans_hdr *rec, struct <br />
<br />
==== Parameters.==== <br />
==== Return Values.====<br />
==== Description.====<br />
<br />
=== llog_delete_record. ===<br />
==== Prototype.==== <br />
int llog_delete_record(struct llog_handle *loghandle, struct llog_handle *cathandle); <br />
<br />
==== Parameters.==== <br />
==== Return Values. ====<br />
==== Description.==== <br />
<br />
<br />
=== lop_cancel.=== <br />
==== Prototype. ====<br />
int llog_cancel_record(struct llog_handle *cathandle, int count, struct llog_cookie *cookie); <br />
<br />
==== Parameters.==== <br />
==== Return Values.==== <br />
<br />
==== Description.==== <br />
For each cookie in the cookie array, we clear the log in-use bit and either: <br />
<br />
*Mark it free in the catalog header and delete it if its empty <br />
*Just write out the log header if the log is not empty <br />
<br />
The cookies maybe in different log files, so we need to get new logs each time.<br />
<br />
=== lop_next_block.=== <br />
==== Prototype. ====<br />
int llog_next_block(struct llog_handle *handle, int curr_idx, int next_idx, __u64 *curr_offset, <br />
<br />
==== Parameters.==== <br />
====Return Values.====<br />
==== Description.====<br />
Return the block in the log that contains record with index next_idx. The curr_idx at the offset curr_offset is used to optimize the search. <br />
<br />
<br />
== Sample Method Table Descriptions ==<br />
The obd_llog api <br />
<br />
The obd_llog api has several methods, setup, cleanup, add, cancel, as part of the OBD operations. These operations have 3 implementations: <br />
<blockquote><br />
'''mds_obd_llog_*:''' simply redirects and uses the method mds_osc_obd, which is normally the LOV running on the MDS to reach the OST’s. <br />
<br />
'''lov_obd_llog_*:''' calls the method on all relevant OSC devices attached to the LOV. A parameter including striping information of the inode is included to determine which OSC’s should generate a log record for their replicating OST.<br />
</blockquote><br />
A more interesting implemenation is the collection of methods that is used by the OSC on the MDS and by the OBDFILTER: <br />
<br />
<blockquote><br />
'''llog_obd_setup:''' sets up a catalog entry based on a log id.<br />
<br><br />
'''llog_obd_cleanup:''' cleans up all catalog entries in the array<br />
<br><br />
'''llog_obd_origin_add:''' adds a record using the catalog in the llog_obd_ctxt array of handles<br />
<br><br />
'''llog_obd_repl_cancel:''' queues a cookie for cancellation on the replicator.<br />
</blockquote><br />
<br />
=== obd_llog_setup(struct obd_device *obd, struct obd_device *disk_obd, int index, int count, struct llog_logid *idarray).===<br />
<br />
To activate the catalogs for logging and make their headers and file handles available is fairly involved. Each system that requires catalogs manages an array of catalogs. This function is given an array of logid’s and an index. The index pertains to the array of logs used by an originator, the array of logid’s is an array with an entry for each osc in the lov stripe descriptor. <br />
=== obd_llog_cleanup(struct obd_device *).===<br />
Cleans up all initialized catalog handles for a device. <br />
<br />
==== int llog_obd_origin_add====<br />
<br />
(struct obd_export *exp, int index, struct llog_rec_hdr *rec, struct lov_stripe_md *lsm, struct llog_cookie *logcookies, int numcookies). Adds a record to the catalog at index index. The lsm is used to identify how to descend an LOV device. The cookies are generated for each record that is added. <br />
<br />
===int llog_obd_repl_cancel(struct obd_device *obd, struct lov_stripe_md *lsm, int count, struct llog_cookie *cookies, int flags).===<br />
Queue the cookies for cancellation. Flags can be 0 or LLC_CANCEL_NOW for immediate cancellation.<br />
<br />
== Configuration Logs ==<br />
Configuration of Lustre is arranged by using llogs with records that describe the configuration. <br />
The first time a configuration is written it is given a version of 0. Each record is numbered. Configurations can then be updated, which results in: <br />
<blockquote><br />
(1) <br />
a new configuration log <br />
<br><br />
(2) <br />
a change descriptor with the previous configuration <br />
</blockquote><br />
<br />
<blockquote><br />
Configurations are then recorded on the configuration obd. At any time there are stored: <br />
<br><br><br />
<br />
(1) <br />
One full configuration log (for the current version) <br />
<br><br />
(2) <br />
A collection of change descriptors for every change made since the initial configuration. <br />
</blockquote><br />
A client uses the configuration logs in two ways: <br />
<blockquote> <br />
*On startup it fetches the full current configuration log from the configuration obd and processes the records to complete the mount command <br />
*A client can also receive a signal that it needs to refresh its configuration. This signal can be an ioctl, /proc/sys file or lock revocation callback. When the client gets this signal it: <br />
<blockquote><br />
– <br />
Determines its current version of the configuration <br />
<br><br />
– <br />
Asks the config obd for the latest version <br />
<br><br />
– <br />
Fetches the change logs to change the current configuration to the latest one <br />
</blockquote><br />
</blockquote><br />
<br />
The last operation is done with llog_process, using a suitable callback function, as well as the logs that the client has in memory.<br />
<br />
== Size Recovery ==<br />
This section contains a discussion of the recovery of MDS cached sizes from OST’s. <br />
The MDS sees open calls which precede any I/O on a file. When an open request reaches the MDS the file inode is in one of two states: <br />
<br />
<blockquote><br />
'''quiescent:''' No I/O is currently happening on the inode <br><br />
'''I/O epoch:''' The inode is in I/O epoch k. <br />
</blockquote><br />
<br />
If no I/O epoch is active the MDS starts a new one. The epoch number will be a random number from boot time which is increased each time a new epoch is started.<br />
<br />
A fairly complicated sequence of events involving the inode may now ensue, such a many other openers. Eventually the clients will all close the file and flush their data. The simplest epoch management scheme is: <br />
<blockquote><br />
<br />
#'''open''' file is opened for write <br />
#'''closed and flushed''' all clients have closed and flushed data <br />
#'''mds''' changes file size and ends epoch <br />
</blockquote><br />
<br />
<br />
When a client closes the file, has no dirty data outstanding and knows the file size and OST size update cookies authoritatively it will include them with the close call to the MDS. The MDS will initiate the setattr to update its cached file size and use the MDS cookies. <br />
<br />
When a client closes but doesn’t satisfy some of these conditions it will still make a close call to the MDS. The MDS will know if this is the last client closing the file. If so, it will indicate in its response to the client that it requires the client to obtain the file size and cookies and make an additional setattr call to the MDS with the cookies.<br />
<br />
The client can flush its data and force a flush of other clients data through the DLM. An obd_getattr call will obtain the file size and cookies for a particular epoch. A slightly more lax scheme is to allow the client to update the MDS even when it has not yet flushed all dirty data to the inode. <br />
<br />
The epoch ends when the MDS receives the setattr call. <br />
<br />
The OST should pin the inode in memory and remember the MDS epoch in volatile data. Perhaps it takes a refcount for each client writing to the inode. Each client can indicate to the OST when it<br />
<br />
== References ==<br />
<br />
[[Category:Architecture|Lustre Logging API]]</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Architecture_-_Security&diff=9696Architecture - Security2007-09-28T09:50:40Z<p>Lydia: /* Summary Tables */</p>
<hr />
<div>== Introduction ==<br />
<br />
This chapter outlines the security architecture for Lustre. <br />
Satyanarayanan gives an excellent treatment of the security <br />
in a distributed filesystem. Our approach seeks to <br />
follow the trail laid out in <br />
his discussion, although the implementation choices and details are quite different. <br />
<br />
=== Usability ===<br />
Only too often have security features led to a serious <br />
burden on administrators. Lustre tries to <br />
avoid this by using existing API’s as much as possible, <br />
particularly in the <br />
area of integration with existing user and group databases. <br />
Lustre only uses standard Unix user <br />
API’s for accessing such data for ordinary users. <br />
Special administrative accounts with un-usual <br />
privileges, to perform backups for example, <br />
require some extra configuration.<br />
<br />
Lustre, unlike like AFS and DCE/DFS, does not mandate the use of a particularauthorization <br />
engine or user and group database, but are happy to work with <br />
what is available. Lustre uses <br />
existing user & group databases and is happy to hook into LDAP, <br />
Active Directory,NIS, or more <br />
specialized databases through the standard NSS database switches. <br />
For example, in an environment <br />
where a small cluster wishes to use the ''/etc/passwd and /etc/group'' <br />
files as the basis of authentication <br />
and authorization, Lustre can easily be configured to use these files. <br />
<br />
We follow in the footsteps of the Samba and NFS v4 projects <br />
in using existing ACL structures, <br />
avoiding the definitions, development, and maintenance of <br />
new access control schemes. <br />
<br />
Lustre implements process authorizations groups as they <br />
provide more security from root setuid <br />
attacks, provided hardened kernels are used. <br />
<br />
New features of Lustre are file encryption, careful analysis of <br />
cross realm authentication and authorization issues and file I/O <br />
authorization.<br />
<br />
=== Taxonomy ===<br />
The first question facing us is what the threats are. <br />
The threats are security <br />
violations and Lustre tries to avoid: <br />
<br />
# Unauthorized release of information. <br />
# Unauthorized modification of information. <br />
# Denial of resource usage. <br />
<br />
<br />
The latter topic is only very partly addressed. Alternative taxonomies of violations and threats exist <br />
and include concepts such as suspicion, modification, conservation, <br />
confinement, and initialization. <br />
We refer to Satya’s discussion. The DOD categorization of <br />
security might fit Lustre in at broadly <br />
the C2 level, controlled access protection, which includes auditing. <br />
<br />
=== Layering ===<br />
<br />
On the whole Lustre server software is charged with <br />
maintaining persistent <br />
copies of data and should largely be trusted. Clients can take <br />
countermeasures <br />
to avoid too much <br />
trust of servers by optionally sending only encrypted <br />
file data to the servers. While clients are much <br />
less controlled than servers, they carry important <br />
obligations for trust. For example, a compromised <br />
client might steal users passwords and render strong <br />
security useless. <br />
The security subsystem has many layers. Our security model, <br />
like much else in Lustre, leverages <br />
on existing efforts and tries to limit implementation to <br />
genuinely new components. The discussion <br />
in this chapter uses the following division of responsibilities. <br />
<blockquote><br />
'''Trust model:''' When the system activates network interfaces <br />
for the purpose of filesystem <br />
request processing or when it accepts connections from clients, <br />
the interfaces or connections are assigned a GSS-API <br />
security interface. <br />
Examples of these are Kerberos 5, <br />
LIPKEY, and OPEN. </blockquote><br />
<blockquote><br />
'''Authentication:''' When a user of the Lustre filesystem <br />
first identifies herself to the system, <br />
credentials for the user need to be established. <br />
Based on the credentials, <br />
GSS will establish a security context between the client and server. </blockquote><br />
<blockquote><br />
'''Group & user management:''' Files have owners, <br />
group owners, and access control lists <br />
which make reference to identities and membership <br />
relations for groups <br />
and users of <br />
the system. Slightly different models apply <br />
within the local realm, <br />
where user and group <br />
id’s can be assumed to have global validity and outside <br />
that domain where a different user <br />
and group database would be used on client systems. </blockquote><br />
<blockquote><br />
'''Authorization:''' Before the filesystem grants access to data <br />
for consumption or modification, <br />
it must do an authorization check based on identity <br />
\and access control lists. </blockquote><br />
<blockquote><br />
'''Cross realm usage:''' When users from one administrative <br />
domain requires access to the <br />
filesystem in a different domain, <br />
a few new problems arise for which <br />
we propose systematic solutions. </blockquote><br />
<blockquote><br />
'''File encryption:''' Lustre uses the encryption and sharing model <br />
proposed by the StorageTek <br />
secure filesystem, SFS [13], but a variety of refinements <br />
and variants <br />
have been proposed <br />
by CFS. </blockquote><br />
<blockquote><br />
'''Auditing:''' For secure environments, auditing file accesses <br />
can be a major deterrent for abuse <br />
and invaluable to find perpetrators. Lustre can audit clients, <br />
meta-data servers, object storage targets, <br />
and access to the key granting services for user <br />
credentials and file encryption <br />
keys.</blockquote><br />
<br />
== Lustre Networks and Security Policy == <br />
<br />
Network trust is of particular importance to Lustre to balance <br />
the requirements of a high performance filesystem with those of a <br />
globally secure filesystem. On the whole Lustre makes few requirements <br />
of trust on the network and can handle insecure <br />
and secure networks with different <br />
policies which will seriously affect performance. The aim is to identify <br />
those networks for which <br />
cryptographic activities can be avoided, in cases where more trust exists. <br />
An initial observation is <br />
that there are two extreme cases that should be covered: <br />
<br />
<blockquote>'''Cluster network:''' Lustre is likely to be used in <br />
compute clusters over networks where: <br />
# Network traffic is private to sender and receiver, i.e. it will not be used by 3rd parties.<br />
# Network traffic is unaltered between sender and receiver.<br />
'''Other network:''' On other networks, the trust level is much lower. <br />
No assumptions are made.</blockquote> <br />
<br />
We realize that there are a variety of cases different <br />
from these two extremes that might merit <br />
special treatment. Such special treatment will be left <br />
to mechanisms outside Lustre. Examples of <br />
special treatment might be to use a VPN to connect <br />
a trusted group of client systems to Lustre with <br />
relaxed assumptions. <br />
<br />
Lustre uses Portals. We will '''not''' change the Portals API to <br />
include features to address security. <br />
Instead, we will use Portals network interfaces to assign <br />
GSS security mechanisms to different <br />
streams of incoming events. Lustre will associate a <br />
security policy with a Lustre network. The <br />
policy is one of: <br />
<br />
# No security <br />
# GSS security with integrity <br />
# GSS security with encrypted RPC data <br />
# GSS security with data encrypted. <br />
<br />
=== Binding GSS Security to Portals Network Interfaces. === <br />
<br />
Incoming and outgoing Portals traffic uses an instance of <br />
a network abstraction layer (NAL). Such an instance is called a <br />
Portals Network Interface. Lustre binds a Portals Network Interface <br />
to traffic from a group of <br />
network endpoints called Lustre Network, <br />
which is identified by a netid. Certain networks are connectionless <br />
and it is less easy to intercept the traffic such as <br />
UDP, QSW or Myrinet networks. Once <br />
Portals binds to the interface, packets may arrive and <br />
will face a default security policy associated <br />
with the event queue as described above. <br />
In the case of TCP/IP, a client (socket file descriptor based) <br />
connection cannot be made available to <br />
the server subsystems (MDS and OST) until it has been accepted. <br />
The accept is handled by a small <br />
auxiliary program called the acceptor. The basic acceptor functionality, <br />
as shown in figure XXX <br />
[need to add] is to accept the socket, determine from <br />
what lustre netid the client is connecting, and <br />
give the accepted socket to the kernel level Portals NAL. <br />
The Portals NAL then starts listening for <br />
traffic on the socket, and interacts with the portals library <br />
for packet delivery. <br />
<br />
To summarize policy selection: <br />
<blockquote><br />
'''Connectionless networks:''' The security policy is associated with the interface at startup <br />
time, through configuration information passed to the kernel at setup time. <br />
If no configuration information is passed, <br />
a strong security backend is selected. <br />
<br><br />
'''Connection-based networks:''' When the network has connections, <br />
the acceptor of connections decides what Portals NI <br />
will handle the connection. It thereby affects security decisions <br />
and assign a security policy to a connection. </blockquote><br />
<br />
We invoke the selected security policies before sending traffic or after receiving Portals events for <br />
arriving traffic. As shown in figure 2.1 this is easily done <br />
by using different Portals network <br />
interfaces and event queues. The event queues ultimately <br />
trigger the appropriate GSS backend to <br />
invoke traffic on an NI. Outgoing traffic is handled similarly <br />
at the level of a connectionless network <br />
interface. For TCP connections are made in the kernel for outgoing <br />
traffice and the kernel needs <br />
awareness of what security policy applies to a certain network. <br />
At present this is a configuration <br />
option, if the need arises it could be negotiated with the acceptor on the server. <br />
<br />
[[Image:interactionGSS.jpg]]<br />
<br />
== GSS Authentication & Trust Model == <br />
<br />
The critical question is to discuss what enforces security in Lustre <br />
and what is security enforcement? Lustre uses the GSS-API as a model for authentication and integrity of network traffic. <br />
Through the GSS API we can be sure that messages sent to server systems <br />
originate from users <br />
with proven identities, according to a GSS security policy installed at <br />
startup or connection acceptance time. The different levels of <br />
security arise from different GSS security backends. On <br />
trusted networks we need ones which are very efficient to avoid <br />
disturbing the high performance <br />
characteristics of the filesystem, but we also need to be prepared <br />
to run over insecure networks. <br />
<br />
=== The GSS-API === <br />
<br />
The GSS-API provides for 3 important security features. Each of these <br />
mechanisms is used in a particular security context which <br />
the API establishes:<br />
<br />
# The acquisition of credentials with a limited lifetime based on a principal or service name. <br />
# Message integrity. <br />
# Message privacy. <br />
<br />
<br />
In order to do so, the GSSAPI is linked to a security mechanism. <br />
At present the GSSAPI offers <br />
the Kerberos 5 security mechanism as a backend. Typically <br />
the GSS-API is used as middle ware <br />
between the request processing layer and a backend security <br />
mechanism, as illustrated in figure 3.1.<br />
<br />
[[Image:securityprot.jpg]]<br />
<br />
For Lustre to use the GSS-API the following steps have been taken: <br />
<br />
# Locate or build a kernel level implementation of the GSS-API with support for the TriLabsrequired Kerberos 5 security mechanism. This can be obtained from the NFS v4 project.0-copy properties of our networking API’s cannot be preserved with that implementation,and changes have to be made to avoid the use of XDR and SUN RPC. <br />
# Modify the Lustre request processing and network I/O API’s to make use of the GSS API to provide their services. This will be original work requiring a fairly detailed design specification for peer scrutiny. The resulting API’s will be similar to those provided by the RPCSEC_GSS-secure RPC API used in NFS v4. They will include various pieces of data returned by the GSS calls in the network packets. <br />
<br />
=== Removing credentials === <br />
<br />
The kdestroy command will remove Kerberos credentials from <br />
the user level GSS daemon. However, we also need to provide a mechanism to flush the kernel <br />
cache of credentials. If this is not handled by the user level GSS daemon an lustre-unlog (ala <br />
kunlog for AFS) should be built. <br />
<br />
=== Special cases ===<br />
<br />
There are a few special connections that need to be maintained. The <br />
most important one is the family of MDS -OSS connections. The OSS should accept such connections and the MDS should have a permanently installed mechanism to provide GSS credentails to the authentication mechanisms. The OSS will treat the MDS principal as priviliged, just like some <br />
other utilities like backup, data migration and HSM software. <br />
<br />
== Process Identity and Authentication ==<br />
<br />
Credentials should be acquired on the basis of a group of processes that can reasonably be expected <br />
to originate from the same authenticated principal. If that process group is determined by the user <br />
id of the process vulnerabilities can arise when unauthorized users can assume this uid. <br />
<br />
One of the most critical security flaws of NFS is that a root user can setuid to any user and acquire <br />
the identity of this user for NFS authorization. In NFS v4 this is still the case -except that the uid <br />
for which su is performed, should have valid credentails. <br />
<br />
The process authentication groups introduced by AFS can partly address this issue, however, it is <br />
only provides true protection on clients with hardened kernel software that make it difficult for the <br />
root user to change kernel memory. SELinux provides such capabilities. Without such, the extra <br />
security offered by PAGs is superficial and should not be provided. <br />
<br />
PAGs may also help if processes under a single uid on a workstation arising from network logins <br />
may not be authenticated as a group. In environments where workstations provide strong authentication there may be no need for this, but pags can provide effective protection here. <br />
<br />
=== Process Authentication Groups === <br />
<br />
Unix authorizes processes based on their ''uid'' -the <br />
''uid'' defines a partition of the set of processes. Many distributed filesystems find this division of the <br />
processes too coarse to give effective protection; such systems introduce smaller Process Authentication Groups (PAG’s). <br />
A group of client processes can be tagged with a PAG. PAG’s are organized to give processes that <br />
truly originate from a single authentication event the same PAG and all other processes a different <br />
PAG. This can separate processes into different PAG’s even if the user ''id'' of the process is the same <br />
and it can bundle processes together that run under different user id’s into the same PAG. <br />
<br />
[[Image:loginpro.jpg]]<br />
<br />
=== Properties of a PAG === <br />
<br />
The smaller group of processes for which authentication should <br />
give access is called a PAG, defined by the following: <br />
<br />
# Every process should belong to a PAG. <br />
# PAG’s are inherited by fork. <br />
# At boot time, init has a zero PAG. <br />
# When a process executes a login-related operation (preferably through a PAM module), this login process would execute a "newpag" system call which places the process in a new PAG. <br />
# Any process can execute ''newpag'' and thereby leave an authentication group of which it was a member.<br />
<br />
=== Implementation === <br />
<br />
Lustre could implement a PAG as a 64 bit number associated with <br />
a process. Login operations will execute a setpag operation. <br><br><br />
A Pluggable Authentication Module (PAM) associated with kinit and login procedures, or the ''llog'' <br />
program, can establish GSSAPI supported credentials with <br />
a user level GSS daemon during or after <br />
login. It is as this point that the PAG for <br />
these credentials should be well defined. <br />
<br />
When the filesystem attempts to execute a filesystem operation for a <br />
PAG for which credentials <br />
are not yet known to the kernel, an upcall could be made to <br />
the GSS daemon to fetch credentials <br />
for the PAG. The Lustre system maintains a cache of <br />
security contexts hashed by PAG. A GSSAPI <br />
authentication handshake will provide credentials <br />
to the meta-data server and establish a security <br />
context for the session; this is illustrated in figure 4.2. <br />
<br />
[[Image:API Authentication.jpg]]<br />
<br />
Once the identity of the PAG has been established, <br />
both the client and the server will have user <br />
identities and group memberships associated with that identity. <br />
How those are handled will be discussed in the next section. <br />
Before authentication has taken place, <br />
a process only gets the credentials <br />
of the anonymous user.<br />
<br />
=== Alternatives ===<br />
<br />
==== AFS implementation ====<br />
<br />
<p> Design Note: The Andrew project used PAG’s for AFS authentication.<br />
They were "hacked" in the sense that they used 2 fields in the groups array. <br />
Root can <br />
fairly easily change fields in the group array on some systems, <br />
but apart from that this implementation avoided changing the kernel.</p><br />
<br />
The Andrew project called the system interface call "''setpag''", <br />
which was executed in terms of ''getgroups'' and ''setgroups''.<br />
<br />
==== PAG and authentication and authorization data ====<br />
<br />
Probably a better way to proceed is <br />
to assign a data structure with security context <br />
and allow all processes in the same PAG to point to <br />
it and take a reference on the data structure. <br />
This authorization data would have room to store a list <br />
of credentials for use on different filesets and security operations. <br />
''newpag'' will be a simple system <br />
call decreasing the refcount on the current PAG of a process and <br />
allocating a new one. We could <br />
use ''/proc/pags'' to hold a list of PAG’s.<br />
<br />
== The user and Group Databases ==<br />
<br />
Lustre uses standard (default) user and group databases <br />
and interfaces to these databases, so that <br />
either enterprise scale LDAP NIS active directories <br />
can be queried or local /etc/passwd /etc/group <br />
databases can be used. <br />
<br />
Users and groups appear fundamentally in two forms to the filesystem: <br />
<br />
# As identities of processes executing filesystem calls.<br />
# As ''user and group owners'' of files, thereby influencing authorization.<br />
<br />
Lustre assumes that within an administrative domain the results of querying for a user or group <br />
name or ''id'' will give consistent results. Lustre also assumes that some special groups and users are created in the authentication databases for use by the filesystem. These address the needs to deal <br />
with administrative users and to handle unknown remote users. <br />
<br />
The user and group databases enter into the filesystem-related API’s in just a few places: <br />
<blockquote><br />
'''Client authorization:''' The client filesystem will check group membership and identity of a process against the content of an ACL to enforce protection. <br />
<br><br />
'''Server authorization:''' The server performs another authorization check. The server assumes the identity and group membership of client processes as determined by the security context. It sets the values of file and group owners before creation of new objects. <br />
<br><br />
'''Client filesystem utilities:''' Utilities like ''ls'' require a means to <br />
translate user id’s to names <br />
and query the user databases in the process. <br />
The filesystem has knowledge of the realm <br />
from which the inodes were obtained, but the system call <br />
interface provides no means to <br />
transfer this information to user level utilities. </blockquote><br />
<br />
As we will explain below when covering cross realm situations there is a fundamental mismatch <br />
between the two uses and the UNIX API’s. Lustre’s solution is <br />
presented in the next section. <br />
<br />
Lustre security descriptor and the Current Protection Sub Domain <br />
<br />
The fundamental question is '''’Can agent X perform operation Y on object Z?'''’. The protection <br />
domain is the collection of agents for which such a question can be asked. In Lustre, the protection <br />
domain consists of: <br />
<br />
# Users and groups. <br />
# Client, MDS, and OST systems. <br />
<br />
For a particular user a current protection subdomain (CPSD) exists, <br />
which is the collection of all <br />
agents the user is a member of. This is shown in figure 5.1. <br />
<br />
UNIX systems introduce a standard protection domain based on what <br />
the UNIX group membership <br />
and user identity are. These are obtained from the ''/etc/passwd and /etc/group'' files, or their network <br />
analogues through the NSS switch model. The UNIX task structure can embed this CPSD information in the task structure of a running process. A user process running with root permissions can <br />
use the ''set(fs)uid, set(fs)gid'', and setgroups system calls, to change the CPSD information. <br />
<br />
Things are more involved for a kernel level server system, to which a user has authenticated over the <br />
network. In that case, the kernel has to reach out to user space to fetch the membership information <br />
and cache it in the kernel to have knowledge of the CPSD. <br />
Such caches may need to be refreshed if <br />
the principal changes it’s uid and is authorized to do so by the server systems. Lustre servers hold <br />
CPSD attributes associated with a principal in the Lustre Security Descriptor (LSD). <br />
<br />
=== Basic handling of users and groups in Lustre === <br />
<br />
When a clients performs an authentication rpc with the server, the server will built a security descriptor for the principal. The security <br />
descriptor is obtained by an upcall. The upcall uses standard Unix API’s to determine: <br />
<br />
# the uid and principal group id associated with the username obtained from the principals name <br />
# the group membership of this uid <br />
<br />
This information is held and cached, with limited life time, in kerel server memory in the LSD <br />
structure associated with the security context for the principal. Other information that will be held <br />
in the LSD is information applicable in non-standard situations. <br />
<br />
# The uid and principal group id of the principal on the client. If the client is not a local client with the same group and user database this is used as described in the next section <br />
# Special server resident attributes of the principal for example: <br />
<br />
[[Image:CPSDinfoforuser.jpg]]<br />
<blockquote><br />
(a) Is the principal elegible for the server to respect setuid/setgid/setgroups information <br />
supplied by the client (these will only be honoured if the file system has an appropriate attribute also). <br />
<br><br />
(b) Which group/uid values will be respected may be set? <br />
<br><br />
(c) Is this principal able to access inodes by file identifier only (without a random cookie). <br />
This is needed for Lustre raid repair and certain client cache synchronizations. <br />
<br><br />
(d) Should this principal get decryption keys for the files even when identies and ACL’s <br />
would not provide these. <br />
<br><br />
(e) Should this principal be able to restore backups (e.g. allow it to place encrypted files <br />
into the file system)<br />
</blockquote><br />
=== Handling setuid, setgid and setgroups ===<br />
<br />
There are several alternative ways in which <br />
these issues can be handled in the context of network security for a file system. <br />
<br />
==== Priviliged principals ==== <br />
<br />
A daemon offering GSS authenticated services can sometimes <br />
perform credential forwarding. Kerberos provides a way to forward credentials. This can provide <br />
excellent NFS v4 Lustre integration. This mechanism is external to Lustre.<br />
<br />
When the service authenticates a user it can hand its credentials to the user level GSS daemon, <br />
which can use them to re-authenticate for furhter services. <br />
Therefore if Lustre requires a credential for a server process <br />
that has properly forwarded the credentials to the GSS daemon, it can transparently authenticate for this. Note that in this case <br />
the Lustre credential should be associated with <br />
the user id and the PAG (or optionally just with the user id). <br />
<br />
If the server is a user level server, the setuid/setgid/setgroups calls can be intercepted to change the <br />
security descriptor associated with the process, in order for <br />
its credentials to be refreshed. If this is <br />
done the threat of root setuid, discussed above, is also eliminated.<br />
<br />
==== Forwarding credentials ==== <br />
<br />
When an unmodified non GSS server is running on a Lustre <br />
client exporting file system information, there may be no facility <br />
for the server system to have <br />
access to credentials for the user. At the mimimum the principal <br />
would have to log in to the server <br />
system and provide authentication information, then the PAG system <br />
would have to be bypassed <br />
to allow the server to obtain the users credentials to become available to the servers PAG. AFS has <br />
recognized this a serious usability issue. <br />
<br />
In order to not render Lustre unusable in this environment, a server resident capability can be <br />
associated with the triple: client, file system and principal. This capability will allow the client to <br />
forward user id, group id and setgroup arrays. <br />
<br />
Extreme prejudice is required and by default no client, file system <br />
and principal has this capability.<br />
<br />
== Cross Realm Authentication and Authorization ==<br />
<br />
In global filesystems such as Lustre, filesets can be imported from different realms. The authentication problems associated with this are suitably solved by systems such as Kerberos. <br />
<br />
A fundamental problem arises from the clash of the user ''id'' / group ''id'' namespaces used in the <br />
different realms. These problems are present in different forms on clients, <br />
where remote user id’s <br />
need to be translated to sensible names in the absence of an UNIX API to do so. On servers adding <br />
a remote principal to an access control list or assigning ownership of a file object to a remote <br />
principal, the creation of a user ''id'' associated with that principal is required. <br />
<br />
Lustre will address both problems transparently to users through the creation of local accounts. <br />
It will also have fileset options to not translate remote user id’s, translate them lazily, or translate <br />
them synchronously to accommodate various use patterns of the filesystem. <br />
<br />
=== The Fundamental Problem in Cross Realm Authorization === <br />
<br />
File ownership in UNIX is in terms of ''uid''’s and ''gid''’s. File ownership on UNIX in a cross realm environment has two fundamental issues: <br />
<br />
# Clients need to find a textual representation of a user id. <br />
# Servers need to store a ''uid'' as owner of an inode, even when they only have realm and remote user ''id'' available.<br />
<br />
Utility invocation such as ''ls -l file'' issues a stat call to the kernel to retrieve the owner and user, and <br />
then use the C-library to issue a getpwent call to retrieve the textual representation of the user id. <br />
<br />
The problem with this is that while the Lustre filesystem may have knowledge that the user name <br />
should be retrieved from a user database in a remote realm, the UNIX API has no mechanism to <br />
transfer this information to the application. <br />
<br />
This is in contrast with the Windows API where files and users are identified by '''SID’s''' which lie in <br />
a much larger namespace and which are endowed with a lookup function that can cross Windows <br />
domains (the function name to do so is ''lookupaccountsid''). <br />
<br />
When the filesystem spans multiple administrative domains, the Unix API’s are not suitable to <br />
correctly identify a user. <br />
<br />
A server cannot really make a remote user an owner/group owner of a file nor can it make ACL <br />
entries for such users, unless it can represent the remote user correctly in terms of the available user <br />
and group databases.<br />
<br />
=== Lustre handling of remote user id’s === <br />
<br />
When a connection is made to a Lustre metadata server the key question that arises is: <br />
<br />
:Is the user / group database on the client identical to that on the server? <br />
<br />
We call such a client ''local'' with respect to the servers. Lustre makes that decision as follows: <br />
<br />
# The acceptor, used to accept TCP clients has a list of local networks. Clients initiating connection from a local network will be marked as local. <br />
# There is a per fileset default that can overrides when the tcp decision is not present. This decision may not be present when clients on other networks connect.<br />
<br />
Each lustre system, client and server, should have an account ''nllu'' “Non local Lustre user” installed. <br />
On the client this is made known to the file system as a mount option, on the server it is a similar <br />
startup option, part of the configuration log. On the client <br />
it is important that there is a name <br />
associated with the nllu user id, to make listings look attractive. <br />
<br />
When the client connects and authenticates a user, <br />
it presents the client’s uid of this user to the <br />
server. The client uid also presents kerberos identity of the user <br />
to the server, and this is used <br />
by the server to establish the server uid of the principal. <br />
For each client the server has a list of <br />
authenticated principals. <br />
<br />
When the server handles a non-local client, <br />
it proceeds as follows for each uid that the server wants <br />
to transfer to the client or vice versa: <br />
<br />
# If uid is handled by the server, and it is among the list of authenticated user id’s translate it. <br />
# All other uids are translated to the server or clients ''nllu'' user id.<br />
<br />
=== Limited manupilation of access control lists on non-local clients ===<br />
<br />
In order to provide an interface to ACL’s from non local clients, <br />
group and user names must be given as text, <br />
for processing on the server. Lustre’s lfs command will provide <br />
an interface to list and set ACL’s. <br />
However, the normal system calls to change ACL’s are not available <br />
for remote manipulation of <br />
ACL’s. <br />
<br />
=== Solutions in Other Filesystems ===<br />
<blockquote><br />
'''AFS:''' We believe there is no work around for the ''getpwent'' issues <br />
in the AFS client filesystem.<br> <br />
The Andrew filesystem has a work-around for the fundamental problem <br />
on the server <br />
side. When users gain privileges in remote cells that require them <br />
to appear as owners of <br />
files or in access control lists, the ''cklog'' program can be used and creates an entry in the <br />
FileSystem Protection DataBase ('''FSPDB''') recording <br />
user id’s and group membership of <br />
the remote realm. The file server can now set owner, group owner, <br />
and ACL entries for <br />
the remote user correctly.<br />
<br><br />
This creation is required only once, but allows the remote cell to treat a cross-realm user <br />
in an identical fashion as a local user. For details, <br />
see the AFS documentation [12]. <br />
<br><br />
'''Windows:''' The Windows filesystem stores user identities in a much larger field than a 32 <br />
bit integer and the fundamental problem does not exist in Windows. <br />
The Win32 function <br />
''lookupaccountsid'' maps a security ''id'' to full information about the user, <br />
including the <br />
domain from where the ''sid'' originated. <br />
File owners are stored as ''sid''’s on the disk. <br />
<br><br />
'''NFS v4:''' This filesystem appears not to explicitly address this problem. <br />
NFS v4 transfers <br />
the file and group owners of inodes to the clients in terms of a string. <br />
On the whole this is <br />
a bad idea for scalability, as it forces the server to make numerous <br />
lookups of such names <br />
from userid’s, even when such data is not necessarily going to be used. <br />
<br><br />
If it is desirable to give clients textual information about users, <br />
they should probably <br />
interact with the user databases themselves to avoid generating a server bottleneck.<br />
</blockquote><br />
<br />
== MDS Authorization: Access Control Lists ==<br />
<br />
Our desire is to implement authorization through access control lists. <br />
The lists must give Linux <br />
Lustre users POSIX ACL semantics. Given that we handle <br />
cross realm users through the creation <br />
of local accounts for those users, we can rely on the POSIX ACL mechanisms. Lustre will use <br />
existing ACL mechanisms available in the Linux kernel and filesystems to authorize access. <br />
<br />
This is the same mechanism used by Samba and NFSv4. <br />
<br />
Good but not perfect compatibility has been established <br />
between CIFS, NFSv4, and POSIX ACL’s. <br />
The subtle semantic differences between Windows, NFSv4, <br />
and POSIX ACL’s can be further refined by adding such <br />
ACL handling to the filesystems supported by Lustre. <br />
<br />
A secondary and separate “access” control list may be added to <br />
filesets that have enabled file encryption. <br />
This ACL will be handled separately after the POSIX ACL <br />
has granted access to the inode. <br />
<br />
=== Fid guessing === <br />
<br />
During pathname traversal the client goes to the parent on the MDS, <br />
going through ACL’s mode bits etc, to get its lookup authorized. <br />
When complete, the client is <br />
given the FID of the object it is looking for. <br />
If permisions on a parent of the fid change, a client <br />
may not be able to repeat this directory traversal. <br />
A well behaved client will drop the cached fid it <br />
obtained when it sees permission changes on any parents. <br />
To do so it uses the directory cache on <br />
the client. <br />
<br />
A fid guessing attack consists of a rogue client <br />
re-using a fid previously obtained or obtained <br />
through guessing in order to start pathname traversal <br />
halfway through a pathname, at the location <br />
of the guessed fid. Protecting against access to <br />
MDS inodes through "fid guessing" is important in <br />
the case of restrictive permissions on a parent, <br />
and less strict permissions underneath. <br />
<br />
To prevent this, the Lustre MDS generates a capability during <br />
lookup which allows the fid to be <br />
re-used for a short time upon presentation of the capability. <br />
Any fid based operation would fail <br />
unless the fid cookie is provided. This limites the exposure <br />
to rogue clients to a short interval, of <br />
which users should preferrably be aware. <br />
<br />
==== Alternatives==== <br />
NFS has made file handles "obscure" to achieve the same.<br />
<br />
=== Implementation Details===<br />
<br />
A fundamental observation about access control lists is that <br />
typically there are a few access control lists per file owner, <br />
but thousands of files and directories <br />
with that owner. As a result it is not efficient, <br />
though widespread practice, to store a copy of the <br />
ACL’s with each inode.<br />
<br />
The Ext3 filesystem has implemented ACL’s with an indirection scheme. <br />
We leverage that scheme <br />
on the server, but not yet on the client.<br />
<br />
== Auditing ==<br />
<br />
Lustre uses a filter layer called smfs which can intercept <br />
all events happening in a filesystem and <br />
on OST’s. <br />
<br />
Auditing happens on all systems. Auditing on clients is necessary to record access to cached <br />
information which only the client filesystem can intercept at reasonable granularity; operations <br />
that result in RPC’s are not cached for efficiency reasons. <br />
On the MDS systems, audit logs are <br />
perhaps the most important since they contain the first point of <br />
access to the file and directories. <br />
On the OSS’s a summary audit log can be written, <br />
with a reference to the entry on the MDS that <br />
needs to be looked at in conjunction with this. <br />
For this the objects on the OSS carry a copy of the <br />
FID of the MDS inode. <br />
<br />
Lustre will send this information to the syslog daemon. <br />
The granularity of the information logged <br />
will be tunable. A tool is available to combine the information <br />
obtained from servers and clients <br />
and to scan for anomalies. <br />
<br />
A critical piece of information that needs to be <br />
logged on the OSS is the full file identitier of the <br />
MDS inode beloning with an object. Moreover, <br />
file inodes on the MDS should contain a pointer to <br />
parent directories to produce traceable pathnames. <br />
<br />
<br />
=== Alternatives === <br />
Such mechanisms are described in Howard Gobioff’s thesis [XXX] section 4.4.3. <br />
<br />
== SFS Style Encryption of File Data ==<br />
<br />
The StorageTek SFS filesystem provides a very interesting way <br />
to store file data encrypted on disks, <br />
while enabling sharing of the data between organizations. <br />
SFS is briefly described in ['''13'''] and ['''14''']. <br />
In this subsection, we review some of the SFS design <br />
a proposed integration with Lustre. We also <br />
provide a more light weight cryptographic file system capability <br />
that is much easier to implement. <br />
<br />
=== Encrypted File Data=== <br />
In SFS, file data can be encrypted. Each file has a unique <br />
random key, which is created at the time the file is created. <br />
It is stored with the file, but it is <br />
encrypted and a third party agent, <br />
called the group server must be access to provide the unencrypted <br />
file key. The key never changes, and remains attached with <br />
the file until the inode of the file is freed. <br />
<br />
The file is encrypted with a technique called ''countermode'', <br />
see ['''15'''], ''section 2.4.5''. Countermode is <br />
a simple mechanism to encrypt an arbitrary extent <br />
in a file without overhead related to the offset at <br />
which the extent is located. <br />
<br />
Ultimately this cryptographic information leads to <br />
a bit stream wich is used to x-or’d with the <br />
file data. Patches probably exist for Linux kernels <br />
to introduce counter mode encryption of files <br />
relatively easily.<br />
<br />
=== Creating a New File === <br />
An information producer creates a new file and can define who <br />
can share this file. At the time of file creation the file is <br />
encrypted with a random key, and an access <br />
control list for the file is generated, <br />
granting access to the file. The group server is involved for two <br />
reasons: <br />
<br />
# It encrypts the file key with a group server key. <br />
# It signs the access control list, including the key, so that its integrity remains known. <br />
<br />
The encrypted file key and the signed access <br />
control list are stored with the file.<br />
<br />
=== The SFS Access Control List === <br />
SFS defines an '''access control list''', which is perhaps <br />
an unfortunate term because it is more a sharing control description. <br />
We call the SFS access control <br />
list the SFS control list.<br><br><br />
The SFS control list contains identity descriptors <br />
which contain a name of a '''group''' (confusingly <br />
called a '''project''' in the SFS literature) and the file key <br />
encrypted with a public key of a group server. <br />
Once an application has access to the inode, <br />
it can scan the SFS control list and present an identity <br />
to a group server, which then returns the key to the file. <br />
This description, taken from the SFS papers <br />
fails to address the issue of integrity of the ACL <br />
for which some measures must be taken. <br />
<br />
A variety of more complicated identities can be added to <br />
the SFS control list. Escrow can be <br />
defined by entries that state that any K of N identities <br />
must be presented to the group server before <br />
the key will be released. There is also a mechanism for <br />
an identity to be recursive with respect to <br />
group servers and require more than one group server to <br />
decrypt before the key is presented. <br />
<br />
In principle anyone who can modify the SFS control <br />
list of the file can add further entries defining <br />
groups managed by other group servers, <br />
by encrypting such entities with the public key of the <br />
group server, provided the group server permits this operation. <br />
<br />
==== The Group Services ==== <br />
The user, or the filesystem on behalf of the user, presents an <br />
identity found in the access control list and <br />
the user credentials to a group server. Group server <br />
checks that the user is a member of the group and <br />
returns the un-encrypted key to the filesystem to <br />
allow it to decrypt the file. <br />
The group server can build an audit trail of access to files. <br />
<br />
The group server must be trusted since it can generate keys <br />
to all files that have an ACL entry <br />
encrypted with the public key of the group server. <br />
<br />
The group acts a bit like a KDC, but it distributes file keys, <br />
not session keys. <br />
<br />
Some aspects of the group service are the subject of <br />
a patent application filed by StorageTek. <br />
<br />
====Weaknesses Noted ====<br />
<blockquote><br />
'''Counter mode encryption:''' This technique has some weaknesses, <br />
called ''malleability'', but adding '''mixing''' can fix this. <br />
Mixing algorithms are worked on but will be patented. <br />
[see Rogaway as UCSC.] <br />
<br><br><br />
'''Access control:''' The SFS access control lists have, <br />
at least theoretically, a weakness. While <br />
it is debatable if the system actually gives the key to a user, <br />
once the key has been given <br />
out to a user the user may retain access to the file data permanently. <br />
For a database file <br />
which remains in existence permanently, this is not an optimal situation. <br />
<br><br />
Ordinary access control lists need to supplement the authorization. <br />
This will prevent <br />
unauthorized access to the file. However, a user with <br />
a key remains a more risky individual with respect to theft of the encrypted data.<br />
</blockquote><br />
<br />
=== Lustre SFS === <br />
Lustre provides hooks for a client node to invoke the services of the <br />
group key service as proposed by SFS. The SFS access control <br />
list will be stored in an extended <br />
attribute, in '''addition to normal ACL’s''' discussed above. <br />
A key feature of this group server is that <br />
principals can manipulate the database, <br />
in contrast with system group databases, which usually <br />
allow only root to make any modifications.<br />
<br />
Lustre also implements a simpler encryption scheme where <br />
the group key service runs in the MDS <br />
nodes. This scheme uses the normal ACL with an extended attribute <br />
to store the encrypted file <br />
encryption key. The MDS has access to the group server key, <br />
and provides the client with the <br />
unencrypted key after authorization for file read <br />
or write succeeds, based on the normal POSIX <br />
ACL. Lustre also has a server option on principals <br />
that allow decryption on certain client nodes, <br />
regardless of the ACL contents. It is recommend <br />
that the acquisition of credentials for such operations <br />
follows extremely secure authentication, <br />
such as multiple principals using specially crafted <br />
frontends to the GSS security daemons. <br />
<br />
<br />
=== Controlling encryption === <br />
An MDS target can have a setting to have none, all or part <br />
of the files encrypted. When part of the files is encrypted, <br />
the user ''lfs'' can mark a directory subtree <br />
for encryption.<br />
<br />
=== Encrypting Directory Operations === <br />
Encrypting directory data is a major challenge for <br />
filesystems. It appears possible to use a scheme like <br />
the SFS scheme to encrypt directory names. <br />
MDS directory inodes can hold an encrypted data encryption key <br />
that is used to encrypt & decrypt <br />
each entry in the directory. <br />
<br />
Clients encrypt names so that the server can perform lookup <br />
on encrypted entries. The client <br />
receives encrypted directory entries and for directorly listings, <br />
the client performs decryption of <br />
the content of the directory.<br />
<br />
== File I/O authorization ==<br />
<br />
=== Capabilities to access objects === <br />
The clients request the OSS to perform create/read/write/truncate/delete <br />
operations on objects. Truncate can probably be treated as write, <br />
particularly because Lustre already has append <br />
only inode flags to protect files from truncation. <br />
The goal is to efficiently authorize these operations, securily. <br />
This section contains the design for this functionality.<br />
<br />
When a client wishes to perform an operation on an object <br />
it has to present a capability. The <br />
capability is prepared by the MDS when a file is opened <br />
and sent to a client. Properties of the <br />
capabilities are: <br />
<br />
# They are signed with a key shared by the MDS and OSS <br />
# Possibly specify a version of an object for which the capability is valid<br />
# They specify the fid for which objects may be accessed<br />
# They specify what operations are authorized.<br />
# A validity time period is specified, assuming coarsely synchronous clocks between the MDS and OSS.<br />
# The kerberos principal for which the capability is specified is included in the capability.<br />
<br />
=== Network is secure/insecure === <br />
If the network is secure, capabilities cannot cannot be <br />
snooped off the wire so no network encryption is needed. However, <br />
normally capabilities have to <br />
be transmitted in an encrypted form between the MDS and <br />
the client and between the client and <br />
the OSS to avoid stealing the capability off the wire.<br />
<br />
GSS can be used for that. If GSS authenticates each user to the OSS a particularly strong scenario <br />
is reached. <br />
<br />
=== Multiple principals === <br />
If a single client perform I/O for multiple users, <br />
the client Lustre software establishes capabilities <br />
for each principal through MDS open. Ultimately the I/O <br />
hinges on a single capability still being valid. <br />
<br />
<br />
=== Revocability and trusted software on client === <br />
If a malicious user is detected, all <br />
OSS’s can refuse access through a “blacklist”. This leads to immediate revocability. <br />
<br />
If client software is trusted, clients will discard cached capabilities associated with files when <br />
permissions change, for example. Cached capabilities only exist <br />
if cache of open file handles is <br />
used. If software on clients cannot be trusted, <br />
a client may regain access to the file data as long as <br />
his credentials are valid. <br />
<br />
This could be refined by immediately expiring capabilities on the OSS’s, <br />
by propagating an object <br />
version number to the OSS’s and including it in the capability. <br />
This would slow down setattr <br />
operations, but increase security. <br />
<br />
=== Corner cases ===<br />
==== Cache flushes ==== <br />
Cache flushes can happend after a file is closed. <br />
If file inode capability cookies are replicated to objects, <br />
this can lead to problems, because a cache flush could <br />
encounter a -EWRONGCOOKIE error, but no open file handle is available to re-authorize the I/O. <br />
If cookies are replicated, when the file is closed <br />
the data needs to be flushed, postponing closes has <br />
proven to be very hard. <br />
<br />
==== Replay of open after recovery ==== <br />
If files saw permission reduction changes while <br />
open, replay of open involves trusting clients to replay honestly, <br />
or including a signed capability to the client <br />
to replay open with pre-authorized access on the MDS. <br />
At present Lustre checks <br />
permission on replay again, so open replay may not be transparent <br />
and may cause client eviction. <br />
<br />
==== Client open file cache ==== <br />
With the client open cache reauthorization after the initial <br />
open is possible but somewhat pointless. If the client software <br />
cannot be trusted data could be <br />
shared between processes on the client anyway. <br />
Lustre uses the client to re-authorize opens from <br />
the open cache. <br />
<br />
==== Write back cache ==== <br />
With the write back cache, a client should be authorized to create <br />
inodes with objects and set initial cookies on the objects it creates. <br />
For the master OSS where the <br />
objects will finally go, such authorization should involve an MDS granted capability, for the cache <br />
OSS, the client can manage security. <br />
<br />
==== Pools ==== <br />
Pools have a security parameter attached to them to authorize clients in a <br />
certain network to perform read, modify, create, <br />
delete operations on objects on a certain OST. This <br />
authorization is done as part of file open, create and unlink. <br />
The MDS will not grant capabilities to <br />
perform operations on objects not allowed by the pool descriptor.<br />
<br />
== Odds and Ends == <br />
<br />
=== Recovery and the security server=== <br />
The security server provides GSS/kerberos (or <br />
other GSS services) and networked user/group <br />
database services to Lustre. This is 3rd party software <br />
and Lustre has not planned modifications to it <br />
to become failover proof. The following details <br />
the situation further:<br />
<br />
The software will consists of: <br />
<br />
# LDAP services, and here the client is the C library queries to that database partly through PAM modules and utilities.<br />
# kerberos KDC, the client is the client and server GSS daemons and kerberos utilities like kinit, their library equivalents and PAM modules <br />
<br />
The server parts of 1,2 can easily be made redundant <br />
as standard IP services. For 1 & 2, client server <br />
protocol failure recovery would consist of retries <br />
and transaction recovery code for the services. <br />
This recovery, for the protocols in 1 & 2, <br />
would be completely outside the scope of Lustre. It just <br />
_may_ exist already, but I doubt these protocols <br />
have good retry capabilities. <br />
<br />
However, if Lustre components, MDS, OSS and clients fail <br />
and recover they will re-use these <br />
services appropriately to recover. In some cases Lustre’s <br />
retry mechanisms may, by coincidence <br />
invoke appropriate rety on protocol 1 & 2.<br />
<br />
=== Renewing Session Keys === <br />
Long running jobs need to renew <br />
their session keys. Lustre will contain sufficient error <br />
handling to refresh credentials from the user level GSS daemons <br />
transparently.<br />
<br />
=== Portability to Less Powerful Operating Systems === <br />
When Lustre is running as a library on a system which <br />
may not have access to IP services, some restrictions <br />
in the security model <br />
are required. For example a GSS security backend running <br />
on a service node operating the job dispatcher should supply <br />
a context that can be used by all client systems. <br />
[XXX: Is this the Red Storm <br />
model? It does not fit BG/L.]<br />
<br />
Every effort will be made to implement a single <br />
security infrastructure and treat such special cases <br />
as policies. <br />
<br />
== Summary Tables == <br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|attribute || structure system || containing data structure || notes <br />
|-<br />
|pag || client || current process & user level GSS daemon || a number belonging to a process group, authenticated process groups <br />
|-<br />
|ignore pag || client || super block || do not use pags, but use uid’s to get <br />
|-<br />
|client remote || client || super block ||treat this client as one in a remote domain<br />
|-<br />
|gss context || client/MDS/OSS || associated with a principal and service export || a list of GSS supplied contexts associated with the export/import on servers/clients <br />
|-<br />
|LSD || MDS || associated with a principal and service export || <br />
describes attributes and policies for a particular principal using a file system target. <br />
|-<br />
|security policy handler || MDS/OSS || lustre network interface || <br />
methods describing the server security in effect on this network <br />
|-<br />
|root squash || MDS || MDS target descriptor || root identity is mapped one-way on the server<br />
|-<br />
|taget allow setid || MDS || MDS target descriptor || grant certain client/principals setid capability <br />
|- <br />
|modify cookie || MDS || MDS target descriptor || when this flag is set, <br />
mode bits or owners will cause the version to be modified on MDS and OSS’s associated inodes. <br />
|- <br />
|principal allow setid || MDS || LSD || allow principal setid <br />
|- <br />
|principal setids || MDS || LSD || which id values can be set by a principal ANY is a permitted value) <br />
|- <br />
|client allow setid || MDS || LSD || when a principal is found that has setid <br />
the client list is given to the MDS <br />
|- <br />
|local of remote domain|| MDS || lustre connection || <br />
is this export to a local or remote client, given to<br />
the kernel by the acceptor upon connect, or else<br />
set as a configured per server default associated<br />
with the network. A client may override and<br />
request to be remote during connection. <br />
|- <br />
|client uid-gid/server uid-gid || MDS || LSD || client and server uids/gids associated with the security context, used for remote clients <br />
|- <br />
|Lustre unknown uid/gid || client/MDS || superblock/LDS || unknown user id to be used with this principal<br />
when translating server inode owners/client owner/groups, used only with remote<br />
clients. <br />
|- <br />
|groups array || MDS || LSD || server cached group membership array <br />
|- <br />
|mds directory inode cookie || MDS || audit mask || random number to authorize fid based access to MDS inodes <br />
|- <br />
|inode crypt key || MDS || mds inodes || random file data encruption key,encrypted with the group server key<br />
|- <br />
|parent fid || MDS || mds inodes || the fid of a parent inode of a file inode,for <br />
pathname reconstruction in audit logs <br />
|- <br />
|file inode cookies || MDS/OSS || MDS inodes/oss objects || random numbers enabling authorization of I/O operations <br />
|- <br />
|file crypt master key || LGKS || lkgs memory || <br />
key to decrypt file encryption keys <br />
|- <br />
|MDS audit mask || MDS || target smfs filter file system || mask to describe what should be logged <br />
|- <br />
|Client audit mask || client filter || super block of filger || mask to describe what should be logged <br />
|- <br />
|OSS audit mask || || target smfs filter file system || ||<br />
|}<br />
<br />
<br />
=== Data structures and variables === <br />
<br />
=== Client configuration options === <br />
<br />
:'''remote lustre user id::''' ''mount option rluid''=<''int''> <br />
<br />
:'''remote lustre group id::''' mount ''option rlgid''=<''int''> <br />
<br />
:'''client is remote::''' mount option ''remote'' <br />
<br />
:'''don’t use pag::''' mount option ''nopag''<br />
<br />
=== MDS configuration options ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|feature || description <br />
|-<br />
|network secure || configuration option: a network with a given <netid> is secure: <netid secpol=<open|GSS|Integrity|Encrypt> <br />
|-<br />
|principal db: || <br />
allow setid <principal> <allowed setuids> <allowed setgids> <allowed targets> <br />
<allowed clients> <br />
|-<br />
| ||allow decrypt <principal> <allowed targets> <allowed clients> <br />
|-<br />
| ||allow fid access <principal> <allowed targets> <allowed clients> <br />
|-<br />
|target security parameters || no root squash <allowed targets> < allowed clients> <br />
|-<br />
| pool security parameters || <pool name> <client netid> <c|m|r|u> <br />
|-<br />
|audit || audit <audit options> <what target> <what clients> <br />
|-<br />
|encrypt || encrypt <all|none|partial> <what target> <br />
|-<br />
|file encryption master key || <key> <br />
|}<br />
<br />
<br />
What client descriptors are lists of ''netid'' and lists of ''netid,nid,pid'' triples.<br />
<br />
=== OSS configuration options === <br />
The OSS needs a principal DB to grant the MDS and <br />
certain administrative users raw object access without cookies. <br />
<br />
:'''princpal db''': allow raw <principal> <allowed targets> <allowed clients/mds><br />
<br />
=== Extended attributes === <br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|What || System || data structure || description<br />
|-<br />
|dir cookie || MDS || MDS dir inode || random 64 number to avoid fid guessing attack<br />
|-<br />
|file read cookie || MDS/OSS || MDS file inodes & objects || Part of the capability to authorize I/O<br />
|-<br />
|crypt key || MDS || encrypted MDS inodes || encrypted inode encryption key<br />
|-<br />
|parent inode pointer || MDS || all MDS inodes || pointer to an directory inode and MDS containing this inode<br />
|-<br />
|crypt subtree || MDS || MDS directory inodes || all file inodes under this directory should be encrypted<br />
|}<br />
<br />
== Changelog ==<br />
'''Version 4.0 (Sep 2005)''' Peter J. Braam, update for security deliverable.<br />
<br />
'''Version 3.0 (Aug 2003)''' Peter J. Braam, rewritten after security CDR<br />
<br />
'''Version 2.0 (Dec. 2002)''' P.D. Innes -updated figures and text, added Changelog<br />
<br />
'''Version 1.0''' P. Braam -original draft<br />
<br />
== References ==<br />
<br />
[[Category:Architecture|Security]]</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Clustered_Metadata&diff=9755Clustered Metadata2007-09-28T09:46:27Z<p>Lydia: /* Clustered MDS protocol. */</p>
<hr />
<div>== Clustering Metadata ==<br />
In order to provide enhanced scalability and performance, Lustre offers clustered metadata servers. This section will give an outline of the architecture.<br />
<br />
The main challenge we face is to provide a substantial gain in scalability of the metadata performance of Lustre through great parallelism of common operations. This involves finding mechanisms which distribute operations evenly over the metadata cluster, while avoiding a more complex protocol involving further RPC’s. The current trend in distributed file system design is to do such clustering by allowing clients to pre-compute the location of the correct services. <br />
<br />
A second challenge is to provide good load balancing and resource allocation properties both for large installations where the metadata cluster acts in effect as a metadata server and in the case of small clusters in which the metadata cluster itself will access metadata on other nodes in the cluster. <br />
<br />
Our architecture accomplishes this by heavily leveraging existing building bricks, primarily existing file systems and their metadata interfaces. <br />
Finally the key challenge is to provide good scalability and simple recovery within the metadata cluster itself. <br />
<br />
=== Summary of metadata clustering configurations.===<br />
Overall the clustered metadata handling is structured as follows. <br />
<br />
* A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata api, augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups. <br />
* Directory formats for file systems used on the MDS devices are changed to introduce a allow directory entries to contain an inode group and identifier of the inode. <br />
* A logical metadata volume (LMV) driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers. <br />
* There is a single metadata protocol that is used by the client file system to make updates on the MDS’s and by the MDS’s to make updates involving other MDS’s. <br />
* There is a single recovery protocol that is used by the clients -MDS and MDS-MDS service. <br />
* Directories can be split across multiple MDS nodes. In that case a primary MDS directory inode contains an extended attribute that points at other MDS inodes which we call directory objects. <br />
<br />
==== Modular design.====<br />
Client systems will have the write back client (WBD) or client file system directly communicate with the LMV driver: it offers themetadata api to the file system and uses the metadata api offered by a collection of MDC drivers. Each MDC driver managed the metadata traffic to one. The function of the LMV is very simple: it figures out from the command issued what MDC to use. This is based on:<br />
<blockquote><br />
(1) <br />
the inode groups in the request <br />
<br><br />
(2) <br />
a hash value of names used in the request, combined with the EA of a primary inode involved in the request. <br />
<br><br />
(3) <br />
for readdir the directory offset combined with the EA of the primary inode <br />
<br><br />
(4) <br />
the clustering descriptor <br />
<br><br />
</blockquote><br />
In any case every command is dispatched to a single metadata server, the clients will not engage more than one metadata server for a single request. The api changes here are minimal and the client part of the implementation is very trivial.<br />
<br />
==== Basics of the operations.====<br />
For the most part, operations are extremely similar or identical to what they were before. In some cases multiple mds servers are involved in updates. Getattr, open, readdir, setattr and lookup methods are unaffected. Methods adding entries to directories are modi.ed in some cases: <br />
<blockquote><br />
(1) <br />
”’mkdir”’ always create the new directory on another MDS <br />
<br><br />
(2) <br />
”’unlink, rmdir, rename”’: may involve more than one MDS <br />
<br><br />
(3) <br />
”’large directories”’ all operations making updates to directories can cause a directory split. The directory split is discussed below. <br />
<br><br />
(4) <br />
”’other operations”’ If no splits large directories are encountered all other operations proceed as they are executed on one MDS.<br />
</blockquote><br />
<br />
==== Directory Split.==== <br />
A directory that is growing larger will be split. There is a fairly heavy penalty associated with splitting the directory and also with renames in within split directories. Moreover, at the point of splitting, inodes become remote and will incur a penalty upon unlink. <br />
<br />
Probably it is best to delay the split until the directory is fairly large, and then to split over several nodes, to avoid further splits being necessary soon afterwards. <br />
==== Locking.==== <br />
Locking can be done in fid order as it is currently done on the MDS. In order to obtain cluster wide ordering of resources, clients must chose the correct coordinating MDS, so that locks taken there initiate the lock ordering sequence to be followed. This is particuarly important for rename, which has to be started at the target or source directory, depending on which the highest order resource occurs. <br />
=== Resources.=== <br />
The MDS handles the persistent storage of metadata objects and directory data. Internal to the metadata service is a large amount of allocation management. <br />
<br />
The use of resources is easily summarized as follows: <br />
<blockquote><br />
;'''Names: ''':<br />
(1) <br />
Look up the name in a directory <br />
<br><br />
(2) <br />
insert / remove names in a directory <br />
<br><br />
<br />
;'''FID:''':<br />
(1) <br />
get attributes for a fid <br />
<br><br />
(2) <br />
create, remove the corresponding object <br />
</blockquote><br />
<br />
The ownership of resources varies among file systems. In local file systems a single node owns all resources. No parallelism can be achieved with this. In traditional clustering file systems, nodes own individual inodes or disk blocks. This leads to fine grained ownership of resources, but involves frequent collisions and poor locality of reference. <br />
<br />
For Lustre we propose that each node owns a moderately large group of objects. There would be a large shared storage pool, which would be subdivided into relatively small file systems, this is shown in figure 6.7.1. We call the small file systems an inode group. Each inode group has its own journal for recovery, is formatted as a file system and can fail-over to another node for availability or adjustment of resources. We will make the load on the inode groups evenly distributed through randomness. <br />
<br />
Clients will get a logical clustered metadata driver which exploits multiple MDC clients (see figure 6.7.2). Just like the logical object volume, the file system itself does not need to know the details of the object distribution, that can be left to a small logical metadata volume driver, invoked by the file system through the same API. The MDS system will get clustering and policy adaptations. The key to this is to add an '''inode group''' identifier to the fid, this marks the inode group to which an inode belongs. The resource database for the cluster will provide every client with a load balancing map which indicates on which MDS server a particular inode group is currently mounted. <br />
<br />
The resource location will be managed as follows: <br />
<blockquote><br />
'''File inodes: '''<br><br />
[[Image:Pages_from_logical_metadata_volume_driver.jpg]]<br />
<br />
* Create the file inode in the inode group of the directory inode holding the name <br />
'''Directory inodes:'''<br />
<br />
* Create in a new inode group <br />
* The policy on which group to pick could be round robin, random, most space available etc. Probably every MDS reply packet should contain some status information to give clients policy information. <br />
<br />
<br />
'''Directory data:'''<br />
<br />
*While the directory is small, keep it with the inode <br />
*When it grows fan it out.<br />
</blockquote><br />
<br />
=== Clustered directories.===<br />
When directories grow we will split them up into '''directory data objects''' which are placed on multiple MDS servers, the figure 6.7.3shows this transition from a single directory to multiple directory objects. This is quite analogous to striped files, which are placed in data objects on multiple servers. <br />
<br />
[[Image:Transition_from.jpg]]<br />
<br />
Directory entries will hold a inode group identifier and inode number, compared to traditional entries holding merely a name and inode number. So once a name is found in directory data the inode group and inode number in this group is known. <br />
<blockquote><br />
'''getattr_lock(parent_fid,:''' name) To find the directory entry itself, the algorithm is similar to that of finding a file stripe. When a directory inode is located, the inode will either contain directory data in which case it is treated as a traditional directory. It can also contain an extended attribute describing what inode buckets exist, by specifying a fid for each bucket, each fid specifying its inode group, inode number and generation. A hash will then map the name to a particular bucket based on this metadata. A normal name lookup in the bucket will proceed to find the entry. <br />
<br><br />
The worst case here is that this requires 3 RPC’s. The first one to do a getattr on the directory inode which would give the extended attribute, the second to find the directory entry on the server holding the bucket, and the 3rd to find the inode attributes in the inode group associated with the entry. However, the common case is that a single RPC is sufficient, since normally the directory inode will be cached already, so the first RPC will go to the server containing the bucket. Furthermore, usually the inode is located on that server and will be fetched in the same RPC. The number of disk reads is identical or one higher than that for large non-clustered directories. <br />
</blockquote><br />
The process of creating a clustered directory is triggered by the directory growing beyond a certain size. The splitting of a directory occurs quite as early as possible, there might be a small effect to performance in the beginning when a directory is split. But the aggregate performance would be good since parallel operations can be done.<br />
<br />
=== Directory inodes and clustered metadata.===<br />
Directory inodes come in two variants: <br />
<br />
'''small directories:''' An ordinary directory inode in a single inode group. <br />
<br><br />
'''large directories:''' <br />
<blockquote><br />
'''master directory inode:''' with an EA pointing to the buckets in other inode groups <br><br />
'''bucket inodes:'''in other inode groups. The buckets are associated with an inode that manages the space allocation for the bucket directory data. The bucket directory data describes the directory data covering a range of hash values. It provides a map from name to (group, inode number) to identify the fid up to the generation number. <br />
</blockquote><br />
The fanout operation, triggered by a directory growing beyond a certain size creates the buckets. This involves a new RPC in the MDS service that allows the creation of a remote bucket, and to populate it with directory entries. <br><br />
<br />
This is a simple RPC that brings no complications to recovery since the buckets are exclusively visible to the the inode group of the master. It is possible that buckets are orphaned, and this requires cleanup. <br />
<br />
Removal of a fanned out directory is similar in complexity. Here it is important to use an MDS to MDS reconnect handshake, identical to the client -mds handshake, between the master inode server and mds’s holding the inode groups holding the buckets to handle the failure of MDS servers that have buckets that need to be removed. <br />
<br />
The security of such MDS-MDS interaction is probably most easily managed with a capability model similar to that found between the clients and OST. <br />
<br />
The attributes of clustered directories are most easily managed in a distributed fashion as we do for the file data objects. <br />
<blockquote><br />
'''size:''' sum of all the bucket sizes<br />
<br><br />
'''link count:''' sum of all the bucket link counts<br />
<br><br />
'''mtime: '''latest of all mtimes<br />
</blockquote><br />
<br />
=== Clustered MDS protocol.=== <br />
The clustered MDS protocol involves a few changes to the API implementation found above. Most of the changes involve some new API calls between MDS servers. The goal is to use a single recovery infrastructure among the MDS servers and the clients, as described earlier in this chapter. Some detailed works remains to be done for the design to avoid cyclic lock dependencies or acknowledgment graphs (refer to section 11.3.6). As described previously in section 11.3.6, we now enforce ACKs for replies. The MDS takes locks on the resources it modi.es, these locks are canceled once ACKs are received. In the clustered MDS scenario, it is important to ensure that a deadlock is not caused as a result of the various systems waiting for ACKs from each other. <br />
<blockquote><br />
;'''mds_create:''' :This call needs modifications when creating a new directory, because the new directory inode and new directory data will be created on another MDS server than the parent. The node holding the parent directory data will do a lookup, find it’s negative and hold a lock. Now it will make an MDC RPC to create a remote inode. When that call returns, the directory data can be filled in. The key issue here is recovery of the remote inode creation, which either requires writing the fid of the created inode in the commit log or using preallocated inodes. It is easy to see that in the normal case of file creations the code path is equally efficient for a clustered metadata service and a single node one. <br />
<br />
;'''mds_rename/mds_link:''': These calls are probably the most interesting of all. It will involve three nodes. The source and target nodes holding the directory data and the node holding the inode which has a link that is to be renamed. An important invariant is that bucket in-odes and directory inodes are always on the same node as the node holding the associated data. This call pattern involves the mds making a remote link RPC to another MDS and a remote setattr RPC to the MDS holding the inode to be renamed. The calls appear to be easily recovered in case of failures. <br />
<br />
;'''mds_unlink:''': This is also a two stage call. Both for creation and unlinking the management of orphans is important. This orphan management is entirely analogous between the MDS and OST data objects. The ''orphaned objects'' can be created during the object creation/removal, objects might be created on the OSTs, but the MDS could fail before recording these in the extended attributes on a persistent store. Similarly, during deletion, its possible that the record of the objects is deleted on the MDS but the corresponding objects are not deleted on the OSTs before some failure occurs. These first situation can only be prevent by requiring the OSTs to log every object creation, the MDS would send an asynchronous message to the OSTs once the objects information has been stored on persistent store. The OSTs can then delete the corresponding logs. Similarly, in the second case, the MDS can keep logs of object deletion, if an OST fails before removing the corresponding objects, it could check with the MDS upon recovery and delete the required objects. <br />
</blockquote><br />
<br />
=== Clustered MDS recovery.===<br />
==== ''Client -MDS replay protocol.'' ==== <br />
The clustered MDS -client recovery protocol is very similar to the single MDS -client protocol. In this case also, the MDS servers need to track whether a client request was executed, replied or committed. The MDS also regards other MDS systems that make requests as part of clustered metadata updates as a client for recovery purposes. If a request is committed, a replay is not required, the metadata server can simply forget the state associated with that request, except that it needs to be capable to reproduce the reply until the client has ack’d that. For a request was not executed, the client can simply retransmit it upon recovery; Lustre uses the word resending for this part of recovery. For requests that were executed and saw replies but lost on persistent storage the retransmission mechanism is called ''replay.'' <br />
==== Replay.==== <br />
To order transaction sequences Lustre uses reply ack’s: the acks server only one purpose to release a lock that enforces ordering of the transaction sequence. In the case where MDS operations involve more than server, the reply "ack" from the primary to secondary servers should only be sent after the client has sent the ack to the first server. This MDS-MDS reply ack is now not really an ack anymore but a simple lock cancelation review. Clients will replay lost transactions to the mds which they originally engaged for the request. Orphaned children will be cleaned up only after replay completes to allow orphaned objects to be re-used during replay.<br />
<br />
==== Failures of multiple MDS nodes.==== <br />
The handling of recovery of orphan objects between clustered metadata servers is identical to that of the single MDS case. <br />
<br />
A new problem arises from multiple metadata server failures, such as present in the case of power-off. In this case the MDS should be rolled back to a consistent state. <br />
<br />
'''Example:''' In transaction one, a node X creates directories a. Then in transaction 2 a cross MDS node rename moves a file with a directory entry on node Y into this directory. It is now possible for this file to lose its directory entry on Y and for the transaction on X not to commit. More complex examples exist. <br />
<br />
We do this with a standard algorithm known as a consistent cut in causal time or snapshot (see Birman [] or other books on distributed algorithms). A consistent snapshot is a state of the MDS that could have been reached through full execution of requests coming from cilents, in other words, a consistent snapshot is a state of the MDS file systems that represents a valid file system. After multiple simultaneous MDS failures the state of the MDS’s must be rolled back to a consistent snapshot. We say that a transaction on an MDS1 depends on a transactions on MDS2 when the completion of a request to MDS1 has the transaction on MDS2 as a component. <br />
<br />
Each MDS retains logs of transactions, sufficiently detailed that they can be undone. Each log record contains a transaction number corresponding to the transaction on this node and the transaction numbers of transactions that were started on other MDS to complete this transaction. The log records can be used for two operations. Log records can be canceled when the MDS cluster as a whole has committed the transactions that relate a particular log record. Also records can be used to undo operations that were already performed. <br />
<br />
Every few seconds, the cluster computes a snapshot by first electing a leader. First leader asks all MDS’s to give their last committed transaction numbers. The MDS’s respond and also provide the transaction numbers for other MDS’s they depend on for this transaction. If an MDS provided a dependency higher than what was committed, that MDS should be asked to resend its transactions and dependencies to account for this. This algorithm then repeats and it converges because it produces a strictly decreasing set of transaction numbers. When the transaction numbers have reached a consistent snapshot, all MDS’s are told what their current last committed transaction for the snapshot is. Clients can be told to discard all requests held for replay that are older than those found in the snapshot. <br />
<br />
The coordinating MDS of a client initiated transaction will first establish that the transaction can commit on all nodes, by acquiring locks on directories and checking for available space existing entries with the same name etc. It may also first perform a directory split if the size is becoming too large, and more MDS nodes are still available. <br />
<br />
All nodes involved in the transaction need to have a transaction sequence number to place the transaction into their sequence and allow correctly replay. At this point the coordinator will: <br />
<blockquote><br />
* start a transaction locally. <br />
* It will then report the transaction sequence number to all other nodes involved in the transaction. <br />
* These nodes will commit (in memory as usual), write a journal record for replay and reply to the coordinator. <br />
* The coordinator will then commit its own transaction. <br />
* The MDS created metadata undo log records, which are subject to normal log commit cancelation messages, but on the coordinator commit messages must be received from the leader before the record will be canceled. <br />
</blockquote><br />
<br />
==== Failover rings.==== <br />
The configuration data can designate a standby MDS that will take over from a failed MDS. By organizing the servers in one or more rings, the nearest working left neighbor MDS can be the failover node. This leads to a simple scheme with multiple failover nodes, avoiding quorum and other complications beyond what is needed for two node clusters.<br />
<br />
== References ==<br />
<br />
[[Category:Architecture|Clustered Metadata]]</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Architecture_-_MPI_LND&diff=9739Architecture - MPI LND2007-08-17T10:38:32Z<p>Lydia: </p>
<hr />
<div>== References ==<br />
<br />
[http://arch.lustre.org/images/b/bb/Lnet_mpi_white_paper_v1.pdf MPI_LND_PDF_paper]<br />
<br />
[[Category:Architecture|MPI_LND]]</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Simul_Parallel_File_System_Test_Tool&diff=3191Simul Parallel File System Test Tool2007-06-12T10:02:06Z<p>Lydia: /* simul */</p>
<hr />
<div>== simul ==<br />
<br />
simul (available in the usual spot on the ftp site, or in ~morrone on MCR) tests most filesystem syscalls from one or more threads on one or more nodes. Building simul is straightforward.<br />
<br />
It is easily run with either pdsh or prun. On MCR, this amounts to:<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
pdsh -E -w <node list> [-n tasks-per-node] /path/simul -d /mnt/lustre/path/ <br />
|}<br />
Using prun is left as an exercise to the reader.<br />
<br />
----<br />
* '''[http://wiki.lustre.org/index.php?title=Front_Page FrontPage]'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Simul_Parallel_File_System_Test_Tool&diff=3190Simul Parallel File System Test Tool2007-06-12T10:01:54Z<p>Lydia: /* simul */</p>
<hr />
<div>== simul ==<br />
<br />
simul (available in the usual spot on the ftp site, or in ~morrone on MCR) tests most filesystem syscalls from one or more threads on one or more nodes. Building simul is straightforward.<br />
<br />
It is easily run with either pdsh or prun. On MCR, this amounts to:<br />
{| border=1 cellspacing=1<br />
|-<br />
|<br />
pdsh -E -w <node list> [-n tasks-per-node] /path/simul -d /mnt/lustre/path/ <br />
|}<br />
Using prun is left as an exercise to the reader.<br />
<br />
----<br />
* '''[http://wiki.lustre.org/index.php?title=Front_Page FrontPage]'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Patchless_Client&diff=2974Patchless Client2007-06-08T08:22:48Z<p>Lydia: /* Versions */</p>
<hr />
<div>== Patchless Client ==<br />
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.<br />
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons. <br />
<br />
We will typically post a "patchless" RPM at the [http://downloads.clusterfs.com/customer download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
[lustre]$ ./configure --with-linux=/unpatched/kernel/source <br />
</pre><br />
|}<br />
=== Versions ===<br />
Currently, the patchless client works with these kernel versions<br />
<br />
Vanilla kernel:<br />
* 2.6.15 (1.6.0)<br />
* 2.6.16 (1.6.0)<br />
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.<br />
* 2.6.18 (1.6.0) Debian 4.0 2.6.18 is also reported working<br />
* 2.6.19 (1.6.0)<br />
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Red Hat Enterprise Linux:<br />
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:<br />
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15<br />
kernels and earlier, nested symlinks will lead to unpredictable results<br />
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to<br />
write/truncate a binary simultaneously while a user from a different client<br />
executes the same binary <br />
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch with RHEL4U5 which resolves the above issues.<br />
<br />
* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Fedora Core:<br />
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Suse:<br />
* SLES 10 (tbd)<br />
<br />
=== Known Issues ===<br />
<br />
many NFS-related bugs are also addressed by the patchless client fixes.</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Patchless_Client&diff=2973Patchless Client2007-06-08T08:21:41Z<p>Lydia: /* Patchless Client */</p>
<hr />
<div>== Patchless Client ==<br />
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.<br />
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons. <br />
<br />
We will typically post a "patchless" RPM at the [http://downloads.clusterfs.com/customer download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
[lustre]$ ./configure --with-linux=/unpatched/kernel/source <br />
</pre><br />
|}<br />
=== Versions ===<br />
Currently, the patchless client works with these kernel versions<br />
<br />
Vanilla kernel:<br />
* 2.6.15 (1.6.0)<br />
* 2.6.16 (1.6.0)<br />
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.<br />
* 2.6.18 (1.6.0) Debian 4.0 2.6.18 is also reported working<br />
* 2.6.19 (1.6.0)<br />
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Red Hat Enterprise Linux:<br />
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:<br />
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15<br />
kernels and earlier, nested symlinks will lead to unpredictable results<br />
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to<br />
write/truncate a binary simultaneously while a user from a different client<br />
executes the same binary <br />
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch<br />
with RHEL4U5 which resolves the above issues.<br />
<br />
* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Fedora Core:<br />
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Suse:<br />
* SLES 10 (tbd)<br />
<br />
=== Known Issues ===<br />
<br />
many NFS-related bugs are also addressed by the patchless client fixes.</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Patchless_Client&diff=2972Patchless Client2007-06-08T08:21:26Z<p>Lydia: /* Patchless Client */</p>
<hr />
<div>== Patchless Client ==<br />
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.<br />
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons. <br />
<br />
We will typically post a "patchless" RPM at the [http://downloads.clusterfs.com/customer download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
[lustre]$ ./configure --with-linux=/unpatched/kernel/source<br />
</pre><br />
|}<br />
=== Versions ===<br />
Currently, the patchless client works with these kernel versions<br />
<br />
Vanilla kernel:<br />
* 2.6.15 (1.6.0)<br />
* 2.6.16 (1.6.0)<br />
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.<br />
* 2.6.18 (1.6.0) Debian 4.0 2.6.18 is also reported working<br />
* 2.6.19 (1.6.0)<br />
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Red Hat Enterprise Linux:<br />
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:<br />
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15<br />
kernels and earlier, nested symlinks will lead to unpredictable results<br />
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to<br />
write/truncate a binary simultaneously while a user from a different client<br />
executes the same binary <br />
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch<br />
with RHEL4U5 which resolves the above issues.<br />
<br />
* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Fedora Core:<br />
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Suse:<br />
* SLES 10 (tbd)<br />
<br />
=== Known Issues ===<br />
<br />
many NFS-related bugs are also addressed by the patchless client fixes.</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Patchless_Client&diff=2971Patchless Client2007-06-08T08:21:06Z<p>Lydia: /* Patchless Client */</p>
<hr />
<div>== Patchless Client ==<br />
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.<br />
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons. <br />
<br />
We will typically post a "patchless" RPM at the [http://downloads.clusterfs.com/customer download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
<br />
[lustre]$ ./configure --with-linux=/unpatched/kernel/source<br />
<br />
</pre><br />
|}<br />
=== Versions ===<br />
Currently, the patchless client works with these kernel versions<br />
<br />
Vanilla kernel:<br />
* 2.6.15 (1.6.0)<br />
* 2.6.16 (1.6.0)<br />
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.<br />
* 2.6.18 (1.6.0) Debian 4.0 2.6.18 is also reported working<br />
* 2.6.19 (1.6.0)<br />
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Red Hat Enterprise Linux:<br />
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:<br />
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15<br />
kernels and earlier, nested symlinks will lead to unpredictable results<br />
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to<br />
write/truncate a binary simultaneously while a user from a different client<br />
executes the same binary <br />
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch<br />
with RHEL4U5 which resolves the above issues.<br />
<br />
* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Fedora Core:<br />
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Suse:<br />
* SLES 10 (tbd)<br />
<br />
=== Known Issues ===<br />
<br />
many NFS-related bugs are also addressed by the patchless client fixes.</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Patchless_Client&diff=2968Patchless Client2007-06-08T08:08:49Z<p>Lydia: /* Patchless Client */</p>
<hr />
<div>== Patchless Client ==<br />
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.<br />
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons. <br />
<br />
We will typically post a "patchless" RPM at the [http://downloads.clusterfs.com/customer download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.<br />
<br />
[lustre]$ ./configure --with-linux=/unpatched/kernel/source<br />
<br />
<br />
=== Versions ===<br />
Currently, the patchless client works with these kernel versions<br />
<br />
Vanilla kernel:<br />
* 2.6.15 (1.6.0)<br />
* 2.6.16 (1.6.0)<br />
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.<br />
* 2.6.18 (1.6.0) Debian 4.0 2.6.18 is also reported working<br />
* 2.6.19 (1.6.0)<br />
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Red Hat Enterprise Linux:<br />
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:<br />
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15<br />
kernels and earlier, nested symlinks will lead to unpredictable results<br />
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to<br />
write/truncate a binary simultaneously while a user from a different client<br />
executes the same binary <br />
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch<br />
with RHEL4U5 which resolves the above issues.<br />
<br />
* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Fedora Core:<br />
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])<br />
<br />
Suse:<br />
* SLES 10 (tbd)<br />
<br />
=== Known Issues ===<br />
<br />
many NFS-related bugs are also addressed by the patchless client fixes.</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2850Lustre DDN Tuning2007-06-05T09:42:59Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0 <br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
</pre><br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2 <br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
</pre><br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2849Lustre DDN Tuning2007-06-05T09:42:42Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0 <br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
</pre><br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2 <br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
</pre><br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2848Lustre DDN Tuning2007-06-05T09:40:08Z<p>Lydia: /* Illustration - one OST per Tier */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0 <br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2 <br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
<pre><br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
</pre><br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2847Lustre DDN Tuning2007-06-05T09:34:52Z<p>Lydia: /* MF, readahead */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0 <br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2 <br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2846Lustre DDN Tuning2007-06-05T09:34:24Z<p>Lydia: /* maxcmds */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2 <br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2845Lustre DDN Tuning2007-06-05T09:32:45Z<p>Lydia: /* Illustration - one OST per Tier */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2<br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
|}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2844Lustre DDN Tuning2007-06-05T09:31:27Z<p>Lydia: /* maxcmds */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2<br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2843Lustre DDN Tuning2007-06-05T09:30:53Z<p>Lydia: /* maxcmds */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=1<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2<br />
<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2842Lustre DDN Tuning2007-06-05T09:30:14Z<p>Lydia: /* maxcmds */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=1<br />
|-<br />
|<br />
disk maxcmds=3 # default is 2<br />
|}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2841Lustre DDN Tuning2007-06-05T09:28:01Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2840Lustre DDN Tuning2007-06-05T09:27:20Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2839Lustre DDN Tuning2007-06-05T09:26:51Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
|}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2838Lustre DDN Tuning2007-06-05T09:25:37Z<p>Lydia: /* Further Tuning tips */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2837Lustre DDN Tuning2007-06-05T09:24:52Z<p>Lydia: /* segment size */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2836Lustre DDN Tuning2007-06-05T09:23:20Z<p>Lydia: /* segment size */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2835Lustre DDN Tuning2007-06-05T09:20:03Z<p>Lydia: /* MF, readahead */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2834Lustre DDN Tuning2007-06-05T09:17:14Z<p>Lydia: /* MF, readahead */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2833Lustre DDN Tuning2007-06-05T09:16:59Z<p>Lydia: /* MF, readahead */</p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_DDN_Tuning&diff=2832Lustre DDN Tuning2007-06-05T09:16:29Z<p>Lydia: </p>
<hr />
<div>== Introduction ==<br />
Guide to configuring DDN storage arrays for use with Lustre. For a complete DDN tuning manual, see Section 3.3 Performance Management of the [http://www.ddnsupport.com/manuals.html DDN manual] for your product.<br />
<br />
== Settings ==<br />
=== MF, readahead ===<br />
For DDN 8500 CFS recommends to disable the readahead. If you consider a 1000-client system, each client with up to 8 read RPCs in flight, this means 8 * 1000 * 1MB = 8GB of reads in flight. With a DDN cache in the range of 2-5GB (depending on model) it is unlikely that the LUN-based readahead would have ANY cache hits even if the file data were contiguous on disk (which it often isn't). The Multiplication Factor (MF) also influences the readahead and should be disabled.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache prefetch=0<br />
cache MF=off<br />
|}<br />
Evidence can be found in the 8500 sgpdd survey: attachment:ddn.xls<br />
<br />
For a DDN S2A 9500 or 9550, CFS also recommends disabling readahead using the commands above.<br />
<br />
=== segment size ===<br />
The cache segment size affects the IO performance noticably. It should be set differently on the MDT (which does small, random IOs) and an OST (which does large, contiguous IOs). The optimum values found in customer testing are 64kB for the MDT and 1MB for the OST. Unfortunately, the ''cache size'' parameter is common to all LUNs on a single DDN and cannot be changed on a per-LUN basis.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
|<br />
cache size=64 # for MDT LUN. size is in kB, 64, 128, 256, 512, 1024, and 2048. Default 128<br />
cache size=1024 # for OST LUN<br />
|}<br />
The effects of cache segment size have not been extensively studied on the S2A 9500 or 9550.<br />
=== Write-back cache ===<br />
Some customers run with the write-back cache turned ON, because it can improve performance noticeably. They are willing to take the risk that when there is a DDN controller crash they need to run e2fsck but it will take them less time than the performance hit from running with the write-back cache turned off.<br />
<br />
Other customers run with the write-back cache OFF, for increased data security and in failover configurations. However, some of these customers experience performance problems with the small writes during journal flush. In this mode it is highly beneficial to also increase the number of OST service threads "options ost ost_num_threads=512" in /etc/modprobe.conf, if the OST has enough RAM (about 1.5MB/thread is preallocated for IO buffers). Having more IO threads allows more IO requests to be in flight and waiting for the disk to complete the synchronous write.<br />
<br />
This is a decision each customer needs to make themselves - is performance more important than the slight risk of data loss + downtime if there is a hardware/software problem on the DDN. Note that there is no risk from an OSS/MDS node crashing, only if the DDN itself fails.<br />
<br />
=== Further Tuning tips ===<br />
Experiences drawn from testing at a large installation:<br />
<br />
* separate the EXT3 OST into 2 Luns, one small lun for the EXT3 journal, and one big for the "data"<br />
* since Lustre 1.0.4 we can supply EXT3 mkfs options when we create the OST like -j -J etc... like the following (where /dev/sdj is formatted before as a journal). The journal size should not be larger than about 1GB (262144 4kB blocks) as it can consume up to this amount of RAM on the OSS node per OST.<br />
{{{<br />
# mke2fs -O journal_dev -b 4096 /dev/sdj [optional size]<br />
in LMC {config}.sh script:<br />
${LMC} --add mds --node io1 --mds iap-mds --dev /dev/sdi --mkfsoptions "-j -J device=/dev/sdj" --failover --group iap-mds<br />
}}}<br />
* Very important: on the S2A 8500, we have proven that we need to create one OST per TIER especially in Write Through (see below an illustration...) This matters if you have 16 Tiers : create 16 OST one Tier each instead of 8 made of 2 Tiers each.<br />
* On the S2A 9500 and 9550, we measured significantly better performance with 2 tiers per LUN.<br />
* Do '''NOT''' partition the DDN LUNs, as this causes ALL IO to the LUNs to be misaligned by 512 bytes. The DDN RAID stripes and cachelines are aligned on 1MB boundaries, and having the partition table on the LUN causes ALL 1MB writes to do a read-modify-write on an extra chunk, and ALL 1MB reads to instead read 2MB from disk into the cache. This has been shown to cause a noticable performance loss.<br />
* you are not obliged to lock in cache the small luns...<br />
=== maxcmds ===<br />
'''S2A 8500:'''<br />
<br />
One customer experienced a 30% improvement in write performance by changing this value from the default 2 to 4. This works only with SATA-based disks and _only_ if you can guarantee that only one controller of the pair will be actually accessing the shared LUNs<br />
<br />
This information comes with a warning, as DDN support do not recommend changing this setting from the default. By increasing the value to 5, the same customer experienced some serious problems.<br />
<br />
The DDN cli commands needed are:<br />
<br />
{{{<br />
disk maxcmds=3 # default is 2<br />
}}}<br />
'''S2A 9500/9550:'''<br />
For this hardware, a value of 16 is recommended. The default value is 6. The maximum value is 32 but values above 16 are not currently recommended by DDN support.<br />
=== Illustration - one OST per Tier ===<br />
{{{<br />
Capacity Block<br />
LUN Label Owner Status (Mbytes) Size Tiers Tier list<br />
------------------------------------------------------------------<br />
0 1 Ready 512 1 1<br />
1 1 Ready 512 1 2<br />
2 1 Ready 512 1 3<br />
3 1 Ready 512 1 4<br />
4 2 Ready [GHS] 1 5<br />
5 2 Ready [GHS] 1 6<br />
6 2 Critical 512 1 7<br />
7 2 Critical 1 8<br />
10 1 Cache Locked 64 512 1 1<br />
11 1 Cache Locked 64 512 1 2<br />
12 1 Cache Locked 64 512 1 3<br />
13 1 Cache Locked 64 512 1 4<br />
14 2 Ready [GHS] 64 512 1 5<br />
15 2 Ready [GHS] 64 512 1 6<br />
16 2 Critical 64 512 1 7<br />
17 2 Critical 64 512 1 8<br />
System verify extent: 16 Mbytes<br />
System verify delay: 30<br />
<br />
}}}</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2734RAID5 Patches2007-05-21T08:24:25Z<p>Lydia: /* Structures */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=0<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
; * order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
; * memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
; * small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2388RAID5 Patches2007-05-16T09:48:46Z<p>Lydia: /* Problems */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
; * order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
; * memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
; * small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_Documentation&diff=2355Lustre Documentation2007-05-14T08:15:24Z<p>Lydia: /* Older Documentation */</p>
<hr />
<div>= Lustre Documentation =<br />
== Lustre Manual ==<br />
<br />
=== Lustre 1.6 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual1_6.html Lustre 1.6 manual V1.1]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual1_6.pdf Lustre 1.6 manual V1.1(A4)]<br />
**''''''[http://wiki.lustre.org/images/7/78/LustreManual-letter1_6.pdf Lustre 1.6 manual V1.1(Letter)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog1_6.html Lustre Manual-Change log]<br />
* '''[http://wiki.lustre.org/index.php?title=Mount_Conf MountConf wiki]''' - A quick reference for those familiar with older versions of Lustre<br />
<br />
=== Lustre 1.4.8 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual.html Lustre 1.4.8 manual V1.37]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual37.pdf Lustre 1.4.8 manual V1.37(A4)]<br />
** '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog.html Lustre Manual-Change log]<br />
<br />
== Interim Lustre 1.8 Documentation ==<br />
* '''[http://wiki.lustre.org/index.php?title=Kerb_Lustre KerbLustre]''' - The 1.8 interim Lustre documentation (Kerberos....)<br />
<br />
== Older Documentation ==<br />
This documentation will be incorporated into the manual. Until that is done you may find useful bits and pieces here.<br />
<br />
* '''Frequently Asked Questions'''<br />
** [http://www.clusterfs.com/faq.html Lustre Faq]<br />
* '''Knowledge Base'''<br />
** [http://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base] - questions and answers<br />
* '''Configuration'''<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Howto LustreHowto]''' - A guide to getting Lustre cluster started.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_LDAP LustreLDAP]''' - A guide for using LDAP with Lustre.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Wizard LustreWizard]''' - The ''Lustre wizard'' or ''lwizard'' is a utility that helps with creation of configuration file for a cluster through asking some simple questions.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Failover LustreFailover]''' - Managing failover<br />
** [http://wiki.lustre.org/index.php?title=Filesystem_Backup FilesystemBackup]''' - How to back up a Lustre filesystem<br />
** [http://wiki.lustre.org/index.php?title=Fsck_Suppor FsckSupport]''' - The lustre fsck tool and Lustre-patched e2fsck (extents, large inode/EA support)<br />
** '''Filesystem Tuning'''<br />
*[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the '''proc''' tunables parameters for Lustre and their usage. It describes several of the ''proc'' tunables including those that effect the client's RPC behaviour and prepare for a substantial reorganization of proc entries.<br />
*'''[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#3.2_DDN_Tuning LustreDdnTuning]''' - A brief guide on tuning DDN S2A 8500 (and maybe 9500) storage optimally for Lustre</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_Documentation&diff=2354Lustre Documentation2007-05-14T08:14:24Z<p>Lydia: /* Older Documentation */</p>
<hr />
<div>= Lustre Documentation =<br />
== Lustre Manual ==<br />
<br />
=== Lustre 1.6 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual1_6.html Lustre 1.6 manual V1.1]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual1_6.pdf Lustre 1.6 manual V1.1(A4)]<br />
**''''''[http://wiki.lustre.org/images/7/78/LustreManual-letter1_6.pdf Lustre 1.6 manual V1.1(Letter)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog1_6.html Lustre Manual-Change log]<br />
* '''[http://wiki.lustre.org/index.php?title=Mount_Conf MountConf wiki]''' - A quick reference for those familiar with older versions of Lustre<br />
<br />
=== Lustre 1.4.8 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual.html Lustre 1.4.8 manual V1.37]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual37.pdf Lustre 1.4.8 manual V1.37(A4)]<br />
** '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog.html Lustre Manual-Change log]<br />
<br />
== Interim Lustre 1.8 Documentation ==<br />
* '''[http://wiki.lustre.org/index.php?title=Kerb_Lustre KerbLustre]''' - The 1.8 interim Lustre documentation (Kerberos....)<br />
<br />
== Older Documentation ==<br />
This documentation will be incorporated into the manual. Until that is done you may find useful bits and pieces here.<br />
<br />
* '''Frequently Asked Questions'''<br />
** [http://www.clusterfs.com/faq.html Lustre Faq]<br />
* '''Knowledge Base'''<br />
** [http://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base] - questions and answers<br />
* '''Configuration'''<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Howto LustreHowto]''' - A guide to getting Lustre cluster started.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_LDAP LustreLDAP]''' - A guide for using LDAP with Lustre.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Wizard LustreWizard]''' - The ''Lustre wizard'' or ''lwizard'' is a utility that helps with creation of configuration file for a cluster through asking some simple questions.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Failover LustreFailover]''' - Managing failover<br />
** [http://wiki.lustre.org/index.php?title=Filesystem_Backup FilesystemBackup]''' - How to back up a Lustre filesystem<br />
** [http://wiki.lustre.org/index.php?title=Fsck_Suppor FsckSupport]''' - The lustre fsck tool and Lustre-patched e2fsck (extents, large inode/EA support)<br />
** '''Filesystem Tuning'''<br />
*[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the '''proc''' tunables parameters for Lustre and their usage. It describes several of the ''proc'' tunables including those that effect the client's RPC behaviour and prepare for a substantial reorganization of proc entries.<br />
*'''[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#3.2_DDN_Tuning LustreDdnTuning]''' - A brief guide on tuning DDN S2A 8500 (and maybe 9500) storage optimally for Lustre</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_Documentation&diff=2353Lustre Documentation2007-05-14T08:14:02Z<p>Lydia: /* Older Documentation */</p>
<hr />
<div>= Lustre Documentation =<br />
== Lustre Manual ==<br />
<br />
=== Lustre 1.6 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual1_6.html Lustre 1.6 manual V1.1]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual1_6.pdf Lustre 1.6 manual V1.1(A4)]<br />
**''''''[http://wiki.lustre.org/images/7/78/LustreManual-letter1_6.pdf Lustre 1.6 manual V1.1(Letter)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog1_6.html Lustre Manual-Change log]<br />
* '''[http://wiki.lustre.org/index.php?title=Mount_Conf MountConf wiki]''' - A quick reference for those familiar with older versions of Lustre<br />
<br />
=== Lustre 1.4.8 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual.html Lustre 1.4.8 manual V1.37]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual37.pdf Lustre 1.4.8 manual V1.37(A4)]<br />
** '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog.html Lustre Manual-Change log]<br />
<br />
== Interim Lustre 1.8 Documentation ==<br />
* '''[http://wiki.lustre.org/index.php?title=Kerb_Lustre KerbLustre]''' - The 1.8 interim Lustre documentation (Kerberos....)<br />
<br />
== Older Documentation ==<br />
This documentation will be incorporated into the manual. Until that is done you may find useful bits and pieces here.<br />
<br />
* '''Frequently Asked Questions'''<br />
. [http://www.clusterfs.com/faq.html Lustre Faq]<br />
* '''Knowledge Base'''<br />
** [http://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base] - questions and answers<br />
* '''Configuration'''<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Howto LustreHowto]''' - A guide to getting Lustre cluster started.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_LDAP LustreLDAP]''' - A guide for using LDAP with Lustre.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Wizard LustreWizard]''' - The ''Lustre wizard'' or ''lwizard'' is a utility that helps with creation of configuration file for a cluster through asking some simple questions.<br />
** [http://wiki.lustre.org/index.php?title=Lustre_Failover LustreFailover]''' - Managing failover<br />
** [http://wiki.lustre.org/index.php?title=Filesystem_Backup FilesystemBackup]''' - How to back up a Lustre filesystem<br />
** [http://wiki.lustre.org/index.php?title=Fsck_Suppor FsckSupport]''' - The lustre fsck tool and Lustre-patched e2fsck (extents, large inode/EA support)<br />
** '''Filesystem Tuning'''<br />
*[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the '''proc''' tunables parameters for Lustre and their usage. It describes several of the ''proc'' tunables including those that effect the client's RPC behaviour and prepare for a substantial reorganization of proc entries.<br />
*'''[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#3.2_DDN_Tuning LustreDdnTuning]''' - A brief guide on tuning DDN S2A 8500 (and maybe 9500) storage optimally for Lustre</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_Documentation&diff=2352Lustre Documentation2007-05-14T08:13:16Z<p>Lydia: /* Lustre 1.4.8 */</p>
<hr />
<div>= Lustre Documentation =<br />
== Lustre Manual ==<br />
<br />
=== Lustre 1.6 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual1_6.html Lustre 1.6 manual V1.1]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual1_6.pdf Lustre 1.6 manual V1.1(A4)]<br />
**''''''[http://wiki.lustre.org/images/7/78/LustreManual-letter1_6.pdf Lustre 1.6 manual V1.1(Letter)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog1_6.html Lustre Manual-Change log]<br />
* '''[http://wiki.lustre.org/index.php?title=Mount_Conf MountConf wiki]''' - A quick reference for those familiar with older versions of Lustre<br />
<br />
=== Lustre 1.4.8 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual.html Lustre 1.4.8 manual V1.37]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual37.pdf Lustre 1.4.8 manual V1.37(A4)]<br />
** '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog.html Lustre Manual-Change log]<br />
<br />
== Interim Lustre 1.8 Documentation ==<br />
* '''[http://wiki.lustre.org/index.php?title=Kerb_Lustre KerbLustre]''' - The 1.8 interim Lustre documentation (Kerberos....)<br />
<br />
== Older Documentation ==<br />
This documentation will be incorporated into the manual. Until that is done you may find useful bits and pieces here.<br />
<br />
* '''Frequently Asked Questions'''<br />
. [http://www.clusterfs.com/faq.html Lustre Faq]<br />
* '''Knowledge Base'''<br />
. [http://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base] - questions and answers<br />
* '''Configuration'''<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Howto LustreHowto]''' - A guide to getting Lustre cluster started.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_LDAP LustreLDAP]''' - A guide for using LDAP with Lustre.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Wizard LustreWizard]''' - The ''Lustre wizard'' or ''lwizard'' is a utility that helps with creation of configuration file for a cluster through asking some simple questions.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Failover LustreFailover]''' - Managing failover<br />
. [http://wiki.lustre.org/index.php?title=Filesystem_Backup FilesystemBackup]''' - How to back up a Lustre filesystem<br />
. [http://wiki.lustre.org/index.php?title=Fsck_Suppor FsckSupport]''' - The lustre fsck tool and Lustre-patched e2fsck (extents, large inode/EA support)<br />
. '''Filesystem Tuning'''<br />
*[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the '''proc''' tunables parameters for Lustre and their usage. It describes several of the ''proc'' tunables including those that effect the client's RPC behaviour and prepare for a substantial reorganization of proc entries.<br />
*'''[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#3.2_DDN_Tuning LustreDdnTuning]''' - A brief guide on tuning DDN S2A 8500 (and maybe 9500) storage optimally for Lustre</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Lustre_Documentation&diff=2351Lustre Documentation2007-05-14T08:12:53Z<p>Lydia: /* Lustre 1.6 */</p>
<hr />
<div>= Lustre Documentation =<br />
== Lustre Manual ==<br />
<br />
=== Lustre 1.6 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual1_6.html Lustre 1.6 manual V1.1]<br />
* '''Lustre Manual - PDF'''<br />
** ''''''[http://wiki.lustre.org/images/7/78/LustreManual1_6.pdf Lustre 1.6 manual V1.1(A4)]<br />
**''''''[http://wiki.lustre.org/images/7/78/LustreManual-letter1_6.pdf Lustre 1.6 manual V1.1(Letter)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog1_6.html Lustre Manual-Change log]<br />
* '''[http://wiki.lustre.org/index.php?title=Mount_Conf MountConf wiki]''' - A quick reference for those familiar with older versions of Lustre<br />
<br />
=== Lustre 1.4.8 ===<br />
* '''Lustre Manual - HTML '''[http://wiki.lustre.org/images/7/78/LustreManual.html Lustre 1.4.8 manual V1.37]<br />
* '''Lustre Manual - PDF'''<br />
. ''''''[http://wiki.lustre.org/images/7/78/LustreManual37.pdf Lustre 1.4.8 manual V1.37(A4)]<br />
* '''Lustre Manual Changelog '''[http://wiki.lustre.org/images/7/78/manual-changelog.html Lustre Manual-Change log]<br />
<br />
== Interim Lustre 1.8 Documentation ==<br />
* '''[http://wiki.lustre.org/index.php?title=Kerb_Lustre KerbLustre]''' - The 1.8 interim Lustre documentation (Kerberos....)<br />
<br />
== Older Documentation ==<br />
This documentation will be incorporated into the manual. Until that is done you may find useful bits and pieces here.<br />
<br />
* '''Frequently Asked Questions'''<br />
. [http://www.clusterfs.com/faq.html Lustre Faq]<br />
* '''Knowledge Base'''<br />
. [http://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base] - questions and answers<br />
* '''Configuration'''<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Howto LustreHowto]''' - A guide to getting Lustre cluster started.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_LDAP LustreLDAP]''' - A guide for using LDAP with Lustre.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Wizard LustreWizard]''' - The ''Lustre wizard'' or ''lwizard'' is a utility that helps with creation of configuration file for a cluster through asking some simple questions.<br />
. [http://wiki.lustre.org/index.php?title=Lustre_Failover LustreFailover]''' - Managing failover<br />
. [http://wiki.lustre.org/index.php?title=Filesystem_Backup FilesystemBackup]''' - How to back up a Lustre filesystem<br />
. [http://wiki.lustre.org/index.php?title=Fsck_Suppor FsckSupport]''' - The lustre fsck tool and Lustre-patched e2fsck (extents, large inode/EA support)<br />
. '''Filesystem Tuning'''<br />
*[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the '''proc''' tunables parameters for Lustre and their usage. It describes several of the ''proc'' tunables including those that effect the client's RPC behaviour and prepare for a substantial reorganization of proc entries.<br />
*'''[https://mail.clusterfs.com/wikis/attachments/LustreManual.html#3.2_DDN_Tuning LustreDdnTuning]''' - A brief guide on tuning DDN S2A 8500 (and maybe 9500) storage optimally for Lustre</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2137RAID5 Patches2007-05-13T06:58:18Z<p>Lydia: /* Problems */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
* order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
* memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
* small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2136RAID5 Patches2007-05-13T06:41:53Z<p>Lydia: /* Logic */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
* order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
* memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
* small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2135RAID5 Patches2007-05-13T06:41:23Z<p>Lydia: /* Logic */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,<br />
#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
* order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
* memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
* small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2134RAID5 Patches2007-05-13T06:40:53Z<p>Lydia: /* handle_stripe() */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,<br />
#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
# delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
# read #16, make a new parity P0; write #0 and P0<br />
# read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
* order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
* memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
* small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=RAID5_Patches&diff=2133RAID5 Patches2007-05-13T06:40:15Z<p>Lydia: /* Structures */</p>
<hr />
<div>= Notes about RAID5 internals =<br />
<br />
== Structures ==<br />
<br />
In Linux RAID5 handles all incoming requests by small units called '''stripes'''.<br />
Stripe is a set of '''blocks''' taken from all disks at the same position.<br />
Block is defined as unit of PAGE_SIZE bytes. <br />
<br />
For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:<br />
{|border=1 cellspacing=1<br />
|-<br />
| || S0 || S8 || S32 || S40 <br />
|-<br />
| Disk1 || #0 || #8 || #32 || #40 <br />
|-<br />
| Disk2 || #16 || #24 || #48 || #56 <br />
|-<br />
| Disk3 || P0 || P8 || P32 || P40 <br />
|}<br />
where:<br />
* Sn -- number of internal stripe<br />
* #n -- is an offset in sectors (512bytes)<br />
* Pn -- parity for other blocks in the stripe (actually it floats among disks)<br />
<br />
As you can see, 8K chunksize means 2 contig. blocks.<br />
<br />
== Logic ==<br />
<br />
''make_request()'' goes through incoming request, breaking it into '''blocks'''<br />
(PAGE_SIZE) and handling them separately. given bio with bi_sector = 0<br />
bi_size = 24K and array described above, ''make_request()'' would handle #0,<br />
#8 and #16.<br />
<br />
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.<br />
<br />
''add_stripe_bio()'' intention is to add bio to given stripe, later in<br />
''handle_stripe()'' we'll be able to use bio and its data for serving requests.<br />
<br />
''handle_stripe()'' is a core of raid5, we'll discsuss it in the next part.<br />
<br />
== handle_stripe() ==<br />
<br />
the routine works with a stripe. it checks what should be done, learns current<br />
state of a stripe in the internal cache, makes decision what I/O is needed to<br />
satisfy users requests and does recovery.<br />
<br />
say, user wants to write block #0 (8 sectors starting from sector 0). raid5's<br />
responsibility is to store new data and update parity P0. there are few <br />
possibilities here:<br />
1. delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?<br />
2. read #16, make a new parity P0; write #0 and P0<br />
3. read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0<br />
<br />
1st way looks the better because it doesn't require very expensive read, but<br />
the problem is that user may need to write only #0 and not #16 in near future.<br />
also, the queue can get unplugged meaning that user wants all requests to<br />
complete (unfortunately, in current block layer there is no way to specify<br />
which exact request user is interested in, so any completion interest means<br />
immediate serving of the whole queue).<br />
<br />
== Problems ==<br />
<br />
Short list of the problem in raid5 we met in Thumper project:<br />
<br />
* order of handling isn't good for large requests<br />
As ''handle_stripe()'' goes in logical block order, it<br />
handles S0, then S8, then again S0 and S8. After the first touch<br />
S0 is left with block #0 uptodate, while #16 and P0 are not. Thus<br />
if the stripe is forced for completion, we'd need to read block<br />
#16 or P0 to get full uptodate stripe. Such reads hurt throughput<br />
almost to death. If just a single process writes, then things are<br />
OK, because nobody unplugs the queue and there is no requests to<br />
force completion of pending request. But the more writers, the<br />
often queue unplug happens and the often pending requests are forced<br />
for completion. Take into account that in reallity we use large<br />
chuck size (128K, 256K and even larger), hence tons of non-uptodate<br />
stripes in the cache and tons of reads in the end.<br />
<br />
* memcpy() is top consumer<br />
all requests go via internal cache. on dual-core 2way opteron<br />
it takes up to 30-33% of CPU doing 1GB/s write<br />
<br />
* small requests<br />
to fill I/O pipes and reach good throughput we need quite large<br />
I/O requests. Lustre does this using bio subsystem on 2.6. but<br />
as it was mentioned, raid5 handles all blocks separately and<br />
issues for every block separate I/O (bio). this is solved partial<br />
by I/O scheduler that merges small requests into bigger ones, but<br />
due to nature of block subsystem, any process that wants I/O to<br />
get completed, ''unplug'' queue and we can get many small requests<br />
in the pipe.<br />
<br />
We developed patches that address described problems. You can find<br />
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Netconsole&diff=2051Netconsole2007-05-11T10:56:18Z<p>Lydia: /* Netconsole with UDP */</p>
<hr />
<div>= Netconsole with UDP =<br />
<br />
<br />
* Configure netconsole for 2.4 redhat kernel (redhat 9.0)<br />
** 1) build the 2.4 redhat kernel with netconsole patch (in rh-2.4 series), make sure build with NETCONSOLE option.<br />
** 2) bootup the netconsole machine with this kernel, make sure netdump rpm is installed in this machine.<br />
** 3) edit /etc/sysconfig/netdump of this machine just like<br />
*** LOCALPORT=6666<br />
*** DEV=eth0<br />
*** NETDUMPADDR=192.168.1.223 #remote server machine (host)<br />
*** NETDUMPPORT=6666<br />
*** NETDUMPMACADDR=00:08:74:96:6D:9B #remote server eth address<br />
*** IDLETIMEOUT=100<br />
** 4) add a user accout named netdump in your host machine, then start netdump service in that netconsole machine.<br />
** 5) netconsole has been setup.<br />
* Configure netconsole in 2.6 kernel<br />
** 1) add this line "netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]" in your boot command line.<br />
** 2) bootup the netconsole machine.<br />
** 3) 2.6 netconsole has been setup.<br />
* Configure conman over netconsole, download conman from ftp://www.clusterfs.com/pub/conman<br />
** 1) edit your /etc/conman.conf, make sure you have set SERVER logdir, SERVER logfile, SERVER port,<br />
*** add following line in /etc/conman.conf<br />
*** NETCONSOLE name="netconsole" dev="client1_ip_address:client1_port" (clientX is netconsole machine)<br />
*** NETCONSOLE1 name="netconsole" dev="client2_ip_address:client2_port"<br />
*** ............<br />
** 2) start conmand server conmand<br />
** 3) start conman netconsole with conman -d conmand_server_ip:conmand_server_port NETCONSOLEx<br />
** 4) Input sysrq command (& + S + sysrq_command) over conman netconsole, you will get the result in conman netconsole<br />
<br />
<br />
<br />
There is a ''netconsole patch'' available that supports kernel level network logging over UDP. More information and a link to the kernel patches can be found at: [http://lwn.net/2001/0927/a/netconsole.php3].<br />
<br />
----<br />
*'''LinuxDebugging'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Netconsole&diff=2050Netconsole2007-05-11T10:54:18Z<p>Lydia: /* Netconsole with UDP */</p>
<hr />
<div>= Netconsole with UDP =<br />
<br />
<br />
* Configure netconsole for 2.4 redhat kernel (redhat 9.0)<br />
** 1) build the 2.4 redhat kernel with netconsole patch (in rh-2.4 series), make sure build with NETCONSOLE option.<br />
** 2) bootup the netconsole machine with this kernel, make sure netdump rpm is installed in this machine.<br />
** 3) edit /etc/sysconfig/netdump of this machine just like<br />
*** LOCALPORT=6666<br />
*** DEV=eth0<br />
*** NETDUMPADDR=192.168.1.223 #remote server machine (host)<br />
*** NETDUMPPORT=6666<br />
*** NETDUMPMACADDR=00:08:74:96:6D:9B #remote server eth address<br />
*** IDLETIMEOUT=100<br />
** 4) add a user accout named netdump in your host machine, then start netdump service in that netconsole machine.<br />
** 5) netconsole has been setup.<br />
* Configure netconsole in 2.6 kernel<br />
** 1) add this line "netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]" in your boot command line.<br />
** 2) bootup the netconsole machine.<br />
** 3) 2.6 netconsole has been setup.<br />
* Configure conman over netconsole, download conman from ftp://www.clusterfs.com/pub/conman<br />
** 1) edit your /etc/conman.conf, make sure you have set SERVER logdir, SERVER logfile, SERVER port,<br />
* add following line in /etc/conman.conf<br />
* NETCONSOLE name="netconsole" dev="client1_ip_address:client1_port" (clientX is netconsole machine)<br />
* NETCONSOLE1 name="netconsole" dev="client2_ip_address:client2_port"<br />
* ............<br />
* 2) start conmand server conmand<br />
* 3) start conman netconsole with conman -d conmand_server_ip:conmand_server_port NETCONSOLEx<br />
* 4) Input sysrq command (& + S + sysrq_command) over conman netconsole, you will get the result in conman netconsole<br />
<br />
<br />
<br />
There is a ''netconsole patch'' available that supports kernel level network logging over UDP. More information and a link to the kernel patches can be found at: [http://lwn.net/2001/0927/a/netconsole.php3].<br />
<br />
----<br />
*'''LinuxDebugging'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Netconsole&diff=2049Netconsole2007-05-11T10:52:23Z<p>Lydia: /* Netconsole with UDP */</p>
<hr />
<div>= Netconsole with UDP =<br />
<br />
<br />
* Configure netconsole for 2.4 redhat kernel (redhat 9.0)<br />
** 1) build the 2.4 redhat kernel with netconsole patch (in rh-2.4 series), make sure build with NETCONSOLE option.<br />
** 2) bootup the netconsole machine with this kernel, make sure netdump rpm is installed in this machine.<br />
** 3) edit /etc/sysconfig/netdump of this machine just like<br />
*** LOCALPORT=6666<br />
*** DEV=eth0<br />
*** NETDUMPADDR=192.168.1.223 #remote server machine (host)<br />
*** NETDUMPPORT=6666<br />
*** NETDUMPMACADDR=00:08:74:96:6D:9B #remote server eth address<br />
*** IDLETIMEOUT=100<br />
** 4) add a user accout named netdump in your host machine, then start netdump service in that netconsole machine.<br />
** 5) netconsole has been setup.<br />
* Configure netconsole in 2.6 kernel<br />
** 1) add this line "netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]" in your boot command line.<br />
** 3) 2.6 netconsole has been setup.<br />
* Configure conman over netconsole, download conman from ftp://www.clusterfs.com/pub/conman<br />
** 1) edit your /etc/conman.conf, make sure you have set SERVER logdir, SERVER logfile, SERVER port,<br />
* add following line in /etc/conman.conf<br />
* NETCONSOLE name="netconsole" dev="client1_ip_address:client1_port" (clientX is netconsole machine)<br />
* NETCONSOLE1 name="netconsole" dev="client2_ip_address:client2_port"<br />
* ............<br />
* 2) start conmand server conmand<br />
* 3) start conman netconsole with conman -d conmand_server_ip:conmand_server_port NETCONSOLEx<br />
* 4) Input sysrq command (& + S + sysrq_command) over conman netconsole, you will get the result in conman netconsole<br />
<br />
<br />
<br />
There is a ''netconsole patch'' available that supports kernel level network logging over UDP. More information and a link to the kernel patches can be found at: [http://lwn.net/2001/0927/a/netconsole.php3].<br />
<br />
----<br />
*'''LinuxDebugging'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Netconsole&diff=2048Netconsole2007-05-11T10:49:59Z<p>Lydia: /* Netconsole with UDP */</p>
<hr />
<div>= Netconsole with UDP =<br />
<br />
<br />
* Configure netconsole for 2.4 redhat kernel (redhat 9.0)<br />
** 1) build the 2.4 redhat kernel with netconsole patch (in rh-2.4 series), make sure build with NETCONSOLE option.<br />
* 2) bootup the netconsole machine with this kernel, make sure netdump rpm is installed in this machine.<br />
* 3) edit /etc/sysconfig/netdump of this machine just like<br />
* LOCALPORT=6666<br />
* DEV=eth0<br />
* NETDUMPADDR=192.168.1.223 #remote server machine (host)<br />
* NETDUMPPORT=6666<br />
* NETDUMPMACADDR=00:08:74:96:6D:9B #remote server eth address<br />
* IDLETIMEOUT=100<br />
* 4) add a user accout named netdump in your host machine, then start netdump service in that netconsole machine.<br />
* 5) netconsole has been setup.<br />
* Configure netconsole in 2.6 kernel<br />
* 1) add this line "netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]" in your boot command line.<br />
* 2) bootup the netconsole machine.<br />
* 3) 2.6 netconsole has been setup.<br />
* Configure conman over netconsole, download conman from ftp://www.clusterfs.com/pub/conman<br />
* 1) edit your /etc/conman.conf, make sure you have set SERVER logdir, SERVER logfile, SERVER port,<br />
* add following line in /etc/conman.conf<br />
* NETCONSOLE name="netconsole" dev="client1_ip_address:client1_port" (clientX is netconsole machine)<br />
* NETCONSOLE1 name="netconsole" dev="client2_ip_address:client2_port"<br />
* ............<br />
* 2) start conmand server conmand<br />
* 3) start conman netconsole with conman -d conmand_server_ip:conmand_server_port NETCONSOLEx<br />
* 4) Input sysrq command (& + S + sysrq_command) over conman netconsole, you will get the result in conman netconsole<br />
<br />
<br />
<br />
There is a ''netconsole patch'' available that supports kernel level network logging over UDP. More information and a link to the kernel patches can be found at: [http://lwn.net/2001/0927/a/netconsole.php3].<br />
<br />
----<br />
*'''LinuxDebugging'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Netconsole&diff=2047Netconsole2007-05-11T10:49:43Z<p>Lydia: /* Netconsole with UDP */</p>
<hr />
<div>= Netconsole with UDP =<br />
<br />
<br />
* Configure netconsole for 2.4 redhat kernel (redhat 9.0)<br />
* 1) build the 2.4 redhat kernel with netconsole patch (in rh-2.4 series), make sure build with NETCONSOLE option.<br />
* 2) bootup the netconsole machine with this kernel, make sure netdump rpm is installed in this machine.<br />
* 3) edit /etc/sysconfig/netdump of this machine just like<br />
* LOCALPORT=6666<br />
* DEV=eth0<br />
* NETDUMPADDR=192.168.1.223 #remote server machine (host)<br />
* NETDUMPPORT=6666<br />
* NETDUMPMACADDR=00:08:74:96:6D:9B #remote server eth address<br />
* IDLETIMEOUT=100<br />
* 4) add a user accout named netdump in your host machine, then start netdump service in that netconsole machine.<br />
* 5) netconsole has been setup.<br />
* Configure netconsole in 2.6 kernel<br />
* 1) add this line "netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]" in your boot command line.<br />
* 2) bootup the netconsole machine.<br />
* 3) 2.6 netconsole has been setup.<br />
* Configure conman over netconsole, download conman from ftp://www.clusterfs.com/pub/conman<br />
* 1) edit your /etc/conman.conf, make sure you have set SERVER logdir, SERVER logfile, SERVER port,<br />
* add following line in /etc/conman.conf<br />
* NETCONSOLE name="netconsole" dev="client1_ip_address:client1_port" (clientX is netconsole machine)<br />
* NETCONSOLE1 name="netconsole" dev="client2_ip_address:client2_port"<br />
* ............<br />
* 2) start conmand server conmand<br />
* 3) start conman netconsole with conman -d conmand_server_ip:conmand_server_port NETCONSOLEx<br />
* 4) Input sysrq command (& + S + sysrq_command) over conman netconsole, you will get the result in conman netconsole<br />
<br />
<br />
<br />
There is a ''netconsole patch'' available that supports kernel level network logging over UDP. More information and a link to the kernel patches can be found at: [http://lwn.net/2001/0927/a/netconsole.php3].<br />
<br />
----<br />
*'''LinuxDebugging'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Guidelines_for_Setting_Up_a_Cluster&diff=1952Guidelines for Setting Up a Cluster2007-05-10T14:22:11Z<p>Lydia: </p>
<hr />
<div>Some tips we've collected while working on clusters that can lead to a more useful debugging experience.<br />
<br />
# '''Shared home directories''' <br/>Having a shared namespace comes in handy all the time. Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc. sharing /home is the least surprising.<br />
# '''PDSH '''<br/>pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.<br />
# '''Regular naming'''<br/>A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh. As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).<br />
# '''Serial Consoles'''<br/>As in any data center, they're essential. Log their output for later retrieval should the kernel go wrong. Provide a useful front end like 'conman' or 'conserver'. Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.<br />
# '''Collect syslogs in one place'''<br/>Its nice to be able to watch one log for errors that are reported to syslog across the cluster.<br />
# '''Remote Power Management'''<br/>If a machine wedges one needs to be able to reboot it without physically flipping a switch. Any number of vendors offer serial controlled power widgets.<br />
# '''Automated Disaster Recovery'''<br/>Its nice to be able to reimage a node by via netbooting and network software installs. Its a low frequency endevour, though.<br />
# '''Boot Quickly'''<br />
## Disable non-essential services to be started at boot-time<br />
## Minimize hardware checks the BIOS may do<br />
## Especially avoid things like RH's Kudzu which can ask for user input before proceeding<br />
<br />
----<br />
* '''FrontPage'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Guidelines_for_Setting_Up_a_Cluster&diff=1951Guidelines for Setting Up a Cluster2007-05-10T14:21:20Z<p>Lydia: </p>
<hr />
<div>Some tips we've collected while working on clusters that can lead to a more useful debugging experience.<br />
<br />
# '''Shared home directories''' <br/>Having a shared namespace comes in handy all the time. Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc. sharing /home is the least surprising.<br />
# '''PDSH '''<br/>pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.<br />
# '''Regular naming'''<br/>A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh. As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).<br />
# '''Serial Consoles'''<br/>As in any data center, they're essential. Log their output for later retrieval should the kernel go wrong. Provide a useful front end like 'conman' or 'conserver'. Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.<br />
# '''Collect syslogs in one place'''<br/>Its nice to be able to watch one log for errors that are reported to syslog across the cluster.<br />
# '''Remote Power Management'''<br/>If a machine wedges one needs to be able to reboot it without physically flipping a switch. Any number of vendors offer serial controlled power widgets.<br />
# '''Automated Disaster Recovery'''<br/>Its nice to be able to reimage a node by via netbooting and network software installs. Its a low frequency endevour, though.<br />
<br />
# '''Boot Quickly'''<br />
<br />
## Disable non-essential services to be started at boot-time<br />
## Minimize hardware checks the BIOS may do<br />
## Especially avoid things like RH's Kudzu which can ask for user input before proceeding<br />
<br />
----<br />
* '''FrontPage'''</div>Lydiahttp://wiki.old.lustre.org/index.php?title=Guidelines_for_Setting_Up_a_Cluster&diff=1950Guidelines for Setting Up a Cluster2007-05-10T14:20:07Z<p>Lydia: </p>
<hr />
<div>Some tips we've collected while working on clusters that can lead to a more useful debugging experience.<br />
<br />
# '''Shared home directories''' <br/>Having a shared namespace comes in handy all the time. Its useful for bringing up lustre builds, collecting logs, blatting configuration files, etc. sharing /home is the least surprising.<br />
<br />
# '''PDSH '''<br/>pdsh is an absolute requirement. Bonus points for being able to pdsh to all nodes from any node.<br />
<br />
# '''Regular naming'''<br/>A node naming scheme that involves a short prefix and regular incrementing decimal node numbers combines very well with automation like pdsh. As machines tend to take on different roles as different people use the cluster, it doesn't make a lot of sense to give hostnames based on roles in the lustre universe (mds, ost, etc).<br />
<br />
# '''Serial Consoles'''<br/>As in any data center, they're essential. Log their output for later retrieval should the kernel go wrong. Provide a useful front end like 'conman' or 'conserver'. Make sure the front-end can send breaks to the kernel's sysrq facility over the serial console.<br />
<br />
# '''Collect syslogs in one place'''<br />
<br />
Its nice to be able to watch one log for errors that are reported to syslog across the cluster.<br />
<br />
# '''Remote Power Management'''<br/>If a machine wedges one needs to be able to reboot it without physically flipping a switch. Any number of vendors offer serial controlled power widgets.<br />
<br />
# '''Automated Disaster Recovery'''<br/>Its nice to be able to reimage a node by via netbooting and network software installs. Its a low frequency endevour, though.<br />
<br />
# '''Boot Quickly'''<br />
<br />
## Disable non-essential services to be started at boot-time<br />
## Minimize hardware checks the BIOS may do<br />
## Especially avoid things like RH's Kudzu which can ask for user input before proceeding<br />
<br />
----<br />
* '''FrontPage'''</div>Lydia