Obsolete Lustre Wiki - User contributions [en]

Configuring Lustre File Striping

2012-02-02T00:10:50Z

Nathan: -c and -i options were swapped

''(Updated: Oct 2009)''
__TOC__
One of the main factors leading to the high performance of Lustre™ file systems is the ability to stripe data over multiple OSTs. The stripe count can be set on a file system, directory, or file level. An example showing the use of striping is provided below.

For additional information, see [http://wiki.lustre.org/manual/LustreManual20_HTML/ManagingStripingFreeSpace.html#50438209_pgfId-5529 Chapter 18: ''Managing File Striping and Free Space''] in the [http://wiki.lustre.org/manual/LustreManual20_HTML/index.html ''Lustre Operations Manual'']

== Setting Up Striping ==

To see the current stripe size, use the command ''lfs getstripe [file, dir, fs]''. This command will produce output similar to the following:

<pre>
root@LustreClient01 lustre]# lfs getstripe /mnt/lustre
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
/mnt/lustre
(Default) stripe_count: 2 stripe_size: 4M stripe_offset: 0
</pre>

In this example, the default stripe count is 2 (that is, data blocks are striped over two OSTs), the default stripe size is 4 MB (the stripe size can be set in K, M or G), and all writes start from the first OST.

'''''Note:''''' When setting the stripe, the offset is set before the stripe count.

The command to set a new stripe pattern on the file system may look like this:

[root@LustreClient01 lustre]# lfs setstripe -s 4M -i 0 -c 1 /mnt/lustre

This example command sets the stripe of ''/mnt/lustre'' to 4 MB blocks starting at OST0 and spanning over one OST. If a new file is created with these settings, the following results are seen:

<pre>
[root@LustreClient01 lustre]# dd if=/dev/zero of=/mnt/lustre/test1 bs=10M count=100

root@LustreClient01 lustre]# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 2.0G 1.1G 830.1M 53% /mnt/lustre[OST:0]
lustre-OST0001_UUID 2.0G 83.3M 1.8G 4% /mnt/lustre[OST:1]
lustre-OST0002_UUID 2.0G 83.3M 1.8G 4% /mnt/lustre[OST:2]
lustre-OST0003_UUID 2.0G 83.3M 1.8G 4% /mnt/lustre[OST:3]
lustre-OST0004_UUID 2.0G 83.3M 1.8G 4% /mnt/lustre[OST:4]
lustre-OST0005_UUID 2.0G 83.3M 1.8G 4% /mnt/lustre[OST:5]

filesystem summary: 11.8G 1.5G 9.7G 12% /mnt/lustre
</pre>

In this example, the entire file was written to the first OST with a very uneven distribution of data blocks.

Continuing with this example, the file is removed and the stripe count is changed to a value of ''-1'' to specify striping over all available OSTs:

[root@LustreClient01 lustre]# lfs setstripe -s 4M -i 0 -c -1 /mnt/lustre

Now, when a file is created, the new stripe setting evenly distributes the data over all the available OSTs:

<pre>
[root@LustreClient01 lustre]# dd if=/dev/zero of=/mnt/lustre/test1 bs=10M count=100
100+0 records in
100+0 records out
1048576000 bytes (1.0 GB) copied, 20.2589 seconds, 51.8 MB/s

[root@LustreClient01 lustre]# lfs df -h
UUID bytes Used Available Use% Mounted on
lustre-MDT0000_UUID 4.4G 214.5M 3.9G 4% /mnt/lustre[MDT:0]
lustre-OST0000_UUID 2.0G 251.3M 1.6G 12% /mnt/lustre[OST:0]
lustre-OST0001_UUID 2.0G 251.3M 1.6G 12% /mnt/lustre[OST:1]
lustre-OST0002_UUID 2.0G 251.3M 1.6G 12% /mnt/lustre[OST:2]
lustre-OST0003_UUID 2.0G 251.3M 1.6G 12% /mnt/lustre[OST:3]
lustre-OST0004_UUID 2.0G 247.3M 1.6G 12% /mnt/lustre[OST:4]
lustre-OST0005_UUID 2.0G 247.3M 1.6G 12% /mnt/lustre[OST:5]

filesystem summary: 11.8G 1.5G 9.7G 12% /mnt/lustre
</pre>

== Displaying Stripe Information for a File ==

The ''lfs getstripe'' command can be used to display information that shows over which OSTs a file is distributed. For example, the output from the following command (showing multiple ''obdidx'' entries) indicates that the file ''test1'' is striped over all six active OSTs in the configuration:

<pre>
[root@LustreClient01 ~]# lfs getstripe /mnt/lustre/test1
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
/mnt/lustre/test1
obdidx objid objid group
0 8 0x8 0
1 4 0x4 0
2 5 0x5 0
3 5 0x5 0
4 4 0x4 0
5 2 0x2 0
</pre>

In contrast, the output from the following command, which lists just a single ''obdidx'' entry, indicates that the file ''test2'' is contained on a single OST:

<pre>
[root@LustreClient01 ~]# lfs getstripe /mnt/lustre/test_2
OBDS:
0: lustre-OST0000_UUID ACTIVE
1: lustre-OST0001_UUID ACTIVE
2: lustre-OST0002_UUID ACTIVE
3: lustre-OST0003_UUID ACTIVE
4: lustre-OST0004_UUID ACTIVE
5: lustre-OST0005_UUID ACTIVE
/mnt/lustre/test_2
obdidx objid objid group
2 8 0x8 0
</pre>

Subsystem Map

2009-12-15T22:36:58Z

Nathan: /* build */

The Lustre subsystems are listed below. For each subsystem, a summary description and code is provided.

==libcfs==

'''Summary'''
 
 
Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.

'''Code'''
lustre/lnet/libcfs/**/*.[ch]

==lnet==

'''Summary'''
 
 
See the [http://www.sun.com/software/products/lustre/docs/Lustre-networking.pdf Lustre Networking] white paper for details.

'''Code'''
lustre/lnet/**/*.[ch]

==ptlrpc==

'''Summary'''
 
 
Ptlrpc implements Lustre communications over LNET.
 
 
All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.
 
 
The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.
 
 
Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

'''Code'''
lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

==llog==

'''Summary'''

'''Overview'''

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.
 
 
LLog is used is various cases. The main LLog use cases are the following:

* mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;

* MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;

* Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);

* LOVEA llog - joins the file LOV EA merge log.

'''General design'''

Each llog type has two main parts:

* ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;

* REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

'''Code'''
obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c

For more information, see [[Logging API]].

==obdclass==

'''Summary'''

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.
 
 
Several examples of obd devices include:

* OSC - object storage client (connects over network to OST)
* OST - object storage target
* LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
* MDC - meta data client (connects over network to MDT)
* MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

'''Code'''
lustre/obdclass/class_hash.c - scalable hash code for imports
lustre/obdclass/class_obd.c - base device handling code
lustre/obdclass/debug.c - helper routines for dumping data structs
lustre/obdclass/genops.c - device allocation/configuration/connection
lustre/obdclass/linux-module. - linux kernel module handling
lustre/obdclass/linux-obdo.c - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c - /proc/sys configuration parameters
lustre/obdclass/lprocfs_status.c - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c - peer target identification by UUID
lustre/obdclass/obd_config.c - configuration file parsing
lustre/obdclass/obd_mount.c - server filesystem mounting
lustre/obdclass/obdo.c - more obdo handling helpers
lustre/obdclass/statfs_pack.c - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c - UUID pack/unpack
lustre/lvfs/lvfs_common.c - kernel interface helpers
lustre/lvfs/lvfs_darwin.c - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c - statistics
lustre/lvfs/lvfs_linux.c - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c - userspace helper routines
lustre/lvfs/prng.c - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c - supplementary group upcall for MDS

==luclass==

'''Summary'''

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.
 
 
Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

* Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;

* Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;

* Objects are reference counted, and cached;

* Every object has a list of ''layers'' (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.
 
 
luclass design is specified in the [http://arch.lustre.org/images/a/aa/Md-api-dld.pdf MD API] DLD.

'''Code'''
include/lu_object.h
obdclass/lu_object.c

==ldlm==

'''Summary'''

The Lustre Distributed Lock Manager (LDLM) is the Lustre locking infrastructure; it handles locks between clients and servers and locks local to a node. Different kinds of locks are available with different properties. Also as a historic heritage, ldlm happens to have some of the generic connection service code (both server and client).

'''Code'''
interval_tree.c - This is used by extent locks to maintain interval trees (bug 11300)
l_lock.c - Resourse locking primitives.
ldlm_extent.c - Extents locking code used for locking regions inside objects
ldlm_flock.c - Bsd and posix locking lock types
ldlm_inodebits.c - Inodebis locks used for metadata locking
ldlm_lib.c - Target and client connecting/reconnecting/recovery code.
Does not really belong to ldlm, but is historically placed
there. Should be in ptlrpc instead.
ldlm_lock.c - This source file mostly has functions dealing with struct
ldlm_lock ldlm_lockd.c - Functions that imply replying to incoming lock-related rpcs
(that could be both on server (lock enq/cancel/...) and client
(ast handling)).
ldlm_plain.c - Plain locks, predecessor to inodebits locks; not widely used now.
ldlm_pool.c - Pools of locks, related to dynamic lrus and freeing locks on demand.
ldlm_request.c - Collection of functions to work with locks based handles as opposed
to lock structures themselves.
ldlm_resource.c - Functions operating on namespaces and lock resources.
include/lustre_dlm.h - Important defines and declarations for ldlm.

==fids==

'''Summary'''

FID is unique object identifier in cluster since 1.7. It has few properties, main of them are the following:

* FID is unique and not reused object identifier;
* FID is allocated by client inside of the sequence granted by server;
* FID is base for ldlm resource used for issuing ldlm locks. This is because FID is unique and as such good for this using;
* FID is base for building client side inode numbers as we can't use server inode+generation anymore, in CMD this is not unique combination;
* FID does not contain store information like inode number or generation and as such easy to migrate;

FID consists of 3 fields:

* f_seq - sequence number
* f_oid - object identifier inside sequence
* f_ver - object version

'''Code'''
fid/fid_request.c
fid/fid_lib.c
fld/*.[ch]

==seq==

'''Summary'''

'''Overview'''

Sequence management is a basic mechanism in new MDS server which is related to managing FIDs.
 
 
FID is an unique object identifier in Lustre starting from version 1.7. All FIDs are organized into sequences. One sequence is number of FIDs. Sequences are granted/allocated to clients by servers. FIDs are allocated by clients inside granted sequence. All FIDs inside one sequence live on same MDS server and as such are one "migration unit" and one "indexing unit", meaning that FLD (FIDs Location Database) indexes them all using one sequence and thus has only one mapping entry for all FIDs in sequence. Please read section devoted to FIDs bellow in the root table to find more info on FLD service and FIDs.
 
 
A sequence has the limit of FIDs to be allocated in it. When this limit is reached, new sequence is allocated. Upon disconnect, server allocates new sequence to the client when it comes back. Previously used sequence is abandoned even if it was not exhausted. Sequences are valuable resource but in the case of recovery, using new sequence makes things easier and also allows to group FIDs and objects by working sessions, new connection - new sequence.

'''Code description'''

Server side code is divided into two parts:

* Sequence controller - allocates super-sequences, that is, sequences of sequences to all servers in cluster (currently only to MDSes as only they are new FIDs aware). Usually first MDS in cluster is sequence controller

* Sequence manager - allocates meta-sequences (smaller range of sequences inside a super-sequence) to all clients, using granted super-sequence from the sequence controller. All MDSs in the cluster (all servers in the future) are sequence managers. The first MDS is, simultaneously, a sequence controller and a sequence manager.

Client side code allocates new sequences from granted meta-sequence. When meta-sequence is exhausted, new one is allocated on server and sent to the client.
 
 
Client code consists of API for working with both server side parts, not only with sequence manager as all servers need to talk to sequence controller, they also use client API for this.
 
 
One important part of client API is FIDs allocation. New FID is allocated in currently granted sequence until sequence is exhausted.

'''Code'''
fid/fid_handler.c - server side sequence management code;
fid/fid_request.c - client side sequence management code;
fid/fid_lib.c - fids related miscellaneous stuff.

==mountconf==

'''Summary'''

MountConf is how servers and clients are set up, started, and configured. A MountConf usage document is [http://wiki.lustre.org/index.php?title=Mount_Conf here].
 
 
The major subsystems are the MGS, MGC, and the userspace tools mount.lustre and mkfs.lustre.
 
 
The basic idea is:

# Whenever any Lustre component is mount(2)ed, we start a MGC.
# This establishes a connection to the MGS and downloads a configuration llog.
# The MGC passes the configuration log through the parser to set up the other OBDs.
# The MGC holds a CR configuration lock, which the MGS recalls whenever a live configuration change is made.

'''Code'''

MountConf file areas:

lustre/mgs/*
lustre/mgc/*
lustre/obdclass/obd_mount.c
lustre/utils/mount_lustre.c
lustre/utils/mkfs_lustre.c

==liblustre==

'''Summary'''

Liblustre is a userspace library, used along with libsysio (developed by Sandia), that allows Lustre usage just by linking (or ld_preload'ing) applications with it. Liblustre does not require any kernel support. It is also used on old Cray XT3 machines (and not so old, in the case of Sandia), where all applications are just linked with the library and loaded into memory as the only code to run. Liblustre does not support async operations of any kind due to a lack of interrupts and other notifiers from lower levels to Lustre. Liblustre includes another set of LNDs that are able to work from userspace.

'''Code'''
dir.c - Directory operations
file.c - File handling operations (like open)
llite_lib.c - General support (init/cleanp/parse options)
lutil.c - Supplementary code to get IP addresses and init various structures
needed to emulate the normal Linux process from other layers' perspective.
namei.c - Metadata operations code.
rw.c - I/O code, including read/write
super.c - "Superblock" operation - mounting/umounting, inode operations.
tests - directory with liblustre-specific tests.

==echo client/server==

'''Summary'''

The echo_client and obdecho are OBD devices which help testing and performance measurement.
 
 
They were implemented originally for network testing - obdecho can replace obdfilter and echo_client can exercise any downstream configurations. They are normally used in the following configurations:

* echo_client -> obdfilter. This is used to measure raw backend performance without any network I/O.
* echo_client -> OSC -> <network> -> OST -> obdecho. This is used to measure network and ptlrpc performance.
* echo_client -> OSC -> <network> -> OST -> obdfilter. This is used to measure performance available to the Lustre client.

'''Code'''
lustre/obdecho/

==client vfs==

'''Summary'''

The client VFS interface, also called '''llite''', is the bridge between the Linux kernel and the underlying Lustre infrastructure represented by the [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV], [https://wikis.clusterfs.com/intra/index.php/Client_metadata_summary MDC], and [https://wikis.clusterfs.com/intra/index.php/Ldlm_summary LDLM] subsystems. This includes mounting the client filesystem, handling name lookups, starting file I/O, and handling file permissions.
 
 
The Linux VFS interface shares a lot in common with the liblustre interface, which is used in the Catamount environment; as of yet, the code for these two subsystems is not common and contains a lot of duplication.

'''Code'''
lustre/llite/dcache.c - Interface with Linux dentry cache/intents
lustre/llite/dir.c - readdir handling, filetype in dir, dir ioctl
lustre/llite/file.c - File handles, file ioctl, DLM extent locks
lustre/llite/llite_close.c - File close for opencache
lustre/llite/llite_internal.h - Llite internal function prototypes, structures
lustre/llite/llite_lib.c - Majority of request handling, client mount
lustre/llite/llite_mmap.c - Memory-mapped I/O
lustre/llite/llite_nfs.c - NFS export from clients
lustre/llite/lloop.c - Loop-like block device export from object
lustre/llite/lproc_llite.c - /proc interface for tunables, statistics
lustre/llite/namei.c - Filename lookup, intent handling
lustre/llite/rw24.c - Linux 2.4 IO handling routines
lustre/llite/rw26.c - Linux 2.6 IO handling routines
lustre/llite/rw.c - Linux generic IO handling routines
lustre/llite/statahead.c - Directory statahead for "ls -l" and "rm -r"
lustre/llite/super25.c - Linux 2.6 VFS file method registration
lustre/llite/super.c - Linux 2.4 VFS file method registration
lustre/llite/symlink.c - Symbolic links
lustre/llite/xattr.c - User-extended attributes

==client vm==

'''Summary'''

Client code interacts with VM/MM subsystems of the host OS kernel to cache data (in the form of pages), and to react to various memory-related events, like memory pressure.
 
 
Two key components of this interaction are:

* cfs_page_t data-type representing MM page. It comes together with the interface to map/unmap page to/from kernel virtual address space, access various per-page bits, like 'dirty', 'uptodate', etc., lock/unlock page. Currently, this data-type closely matches the Linux kernel page. It has to be straightened out, formalized, and expanded to include functionality like querying about total number of pages on a node, etc.
* MM page operations in cl_page (part of new client I/O interface).

'''Code'''

This describes the ''next generation'' Lustre client I/O code, which is expected to appear in Lustre 2.0. Code location is not finalized.
 
 
cfs_page_t interface is defined and implemented in:

lnet/include/libcfs/ARCH/ARCH-mem.h
lnet/libcfs/ARCH/ARCH-mem.c

Generic part of cl-page will be located in:

include/cl_object.h
obdclass/cl_page.c
obdclass/cl_object.c

Linux kernel implementation is currently in:

llite/llite_cl.c

==client I/O==

'''Summary'''

Client I/O is a group of interfaces used by various layers of a Lustre client to manage file data (as opposed to metadata). Main functions of these interfaces are:

* Cache data, respecting limitations imposed both by hosting MM/VM, and by cluster-wide caching policies, and
* Form a stream of efficient I/O RPCs, respecting both ordering/timing constraints imposed by the hosting VFS (e.g., POSIX guarantees, O_SYNC, etc.), and cluster-wide IO scheduling policies.

Client I/O subsystem interacts with VFS, VM/MM, DLM, and PTLRPC.
 
 
Client I/O interfaces are based on the following data-types:

* cl_object: represents a file system object, both a file, and a stripe;
* cl_page: represents a cached data page;
* cl_lock: represents an extent DLM lock;
* cl_io: represents an ongoing high-level IO activity, like read(2)/write(2) system call, or sub-io of another IO;
* cl_req: represents a network RPC.

'''Code'''

This describes the ''next generation'' Lustre client I/O code. The code location is not finalized. The generic part is at:

include/cl_object.h
obdclass/cl_object.c
obdclass/cl_page.c
obdclass/cl_lock.c
obdclass/cl_io.c

Layer-specific methods are currently at:

lustre/LAYER/LAYER_cl.c

where LAYER is one of llite, lov, osc.

==client metadata==

'''Summary'''

The Meta Data Client (MDC) is the client-side interface for all operations related to the Meta Data Server MDS. In current configurations there is a single MDC on the client for each filesystem mounted on the client. The MDC is responsible for enqueueing metadata locks (via LDLM), and packing and unpacking messages on the wire.
 
 
In order to ensure a recoverable system, the MDC is limited at the client to only a single filesystem-modifying operation in flight at one time. This includes operations like create, rename, link, unlink, setattr.
 
 
For non-modifying operations like getattr and statfs the client can multiple RPC requests in flight at one time, limited by a tunable on the client, to avoid overwhelming the MDS.

'''Code'''
lustre/mdc/lproc_mdc.c - /proc interface for stats/tuning
lustre/mdc/mdc_internal.h - Internal header for prototypes/structs
lustre/mdc/mdc_lib.c - Packing of requests to MDS
lustre/mdc/mdc_locks.c - Interface to LDLM and client VFS intents
lustre/mdc/mdc_reint.c - Modifying requests to MDS
lustre/mdc/mdc_request.c - Non-modifying requests to MDS

==client lmv==

'''Summary'''

LMV is a module which implements CMD client-side abstraction device. It allows client to work with many MDSes without any changes in Llite module and even without knowing that CMD is supported. Llite just translates Linux VFS requests into metadata API calls and forwards them down to the stack.
 
 
As LMV needs to know which MDS to talk for any particular operation, it uses some new services introduced in CMD3 times. They are:

* FLD (Fids Location Database) - having FID or rather its sequence, lookup MDS number where this FID is located;
* SEQ (Client Sequence Manager) - LMV uses this via children MDCs for allocating new sequences and FIDs.

LMV supports split objects. This means that for every split directory it creates special in-memory structure which contains information about object stripes. This includes MDS number, FID, etc. All consequent operations use these structures for determining what MDS should be used for particular action (create, take lock, etc).

'''Code'''
lmv/*.[ch]

==lov==

'''Summary'''

The LOV device presents a single virtual device interface to upper layers (llite, liblustre, MDS). The LOV code is responsible for splitting of requests to the correct OSTs based on striping information (lsm), and the merging of the replies to a single result to pass back to the higher layer.
 
 
It calculates per-object membership and offsets for read/write/truncate based on the virtual file offset passed from the upper layer. It is also responsible for splitting the locking across all servers as needed.
 
 
The LOV on the MDS is also involved in object allocation.

'''Code'''
lustre/lov/lov_ea.c - Striping attributes pack/unpack/verify
lustre/lov/lov_internal.h - Header for internal function prototypes/structs
lustre/lov/lov_merge.c - Struct aggregation from many objects
lustre/lov/lov_obd.c - Base LOV device configuration
lustre/lov/lov_offset.c - File offset and object calculations
lustre/lov/lov_pack.c - Pack/unpack of striping attributes
lustre/lov/lov_qos.c - Object allocation for different OST loading
lustre/lov/lov_request.c - Request handling/splitting/merging
lustre/lov/lproc_lov.c - /proc/fs/lustre/lov tunables/statistics

==quota==

'''Summary'''

Quotas allow a system administrator to limit the maximum amount of disk space a user or group can consume. Quotas are set by root, and can be specified for individual users and/or groups. Quota limits can be set on both blocks and inodes.
 
 
Lustre quota enforcement differs from standard Linux quota support in several ways:

* Lustre quota are administered via the lfs command, whereas standard Linux quota uses the quotactl interface.
* As Lustre is a distributed filesystem, lustre quotas are also distributed in order to limit the impact on performance.
* Quotas are allocated and consumed in a quantized fashion.

'''Code'''

Quota core:

lustre/quota/quota_adjust_qunit.c
lustre/quota/quota_check.c
lustre/quota/quotacheck_test.c
lustre/quota/quota_context.c
lustre/quota/quota_ctl.c
lustre/quota/quota_interface.c
lustre/quota/quota_internal.h
lustre/quota/quota_master.c

Interactions with the underlying ldiskfs filesystem:

lustre/lvfs/fsfilt_ext3.c
lustre/lvfs/lustre_quota_fmt.c
lustre/lvfs/lustre_quota_fmt_convert.c

Hooks under:

lustre/mds
lustre/obdfilter

Regression tests:

lustre/tests/sanity-quota.sh

==security-gss==

'''Summary'''

The secure ptlrpc (sptlrpc) is a framework inside of ptlrpc layer. It act upon both side of each ptlrpc connection between 2 nodes, doing transformation on every RPC message, turn this into a secure communication link. By using GSS, sptlrpc is able to support multiple authentication mechanism, but currently we only support Kerberos 5.
 
 
Supported security flavors:

* null: no authentication, no data transform, thus no performance overhead; compatible with 1.6;
* plain: no authentication, simple data transform, minimal performance overhead;
* krb5x: per-user basis client-server mutual authentication using Kerberos 5, sign or encrypt data, could have substantial CPU overhead.

'''Code'''
lustre/ptlrpc/sec*.c
lustre/ptlrpc/gss/
lustre/utils/gss/

==security-capa==

'''Summary'''

Capabilities are pieces of data generated by one service - the master service, passed to a client and presented by the client to another service - the slave service, to authorize an action. It is independent from the R/W/X permission based file operation authorization.

'''Code'''
lustre/llite/llite_capa.c
lustre/mdt/mdt_capa.c
lustre/obdfilter/filter_capa.c
lustre/obdclass/capa.c
lustre/include/lustre_capa.h

==security-identity==

'''Summary'''

Lustre identity is a miscellaneous framework for lustre file operation authorization. Generally, it can be divided into two parts:

* User-identity parse / upcall / mapping.
* File operation permission maintenance and check, includes the traditional file mode based permission and ACL based permission.

'''Code'''
/llite/llite_rmtacl.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_lib.c
lustre/obdclass/idmap.c
lustre/utils/l_getidentity.c
lustre/include/lustre_idmap.h

lustre/llite/xattr.c
lustre/mdt/mdt_xattr.c
lustre/cmm/cmm_object.c
lustre/cmm/mdc_object.c
lustre/mdd/mdd_permission.c
lustre/mdd/mdd_object.c
lustre/mdd/mdd_dir.c
lustre/obdclass/acl.c
lustre/include/lustre_eacl.h

==OST==

'''Summary'''

OST is a very thin layer of data server. Its main responsibility is to translate RPCs to local calls of obdfilter, i.e. RPC parsing.

'''Code'''
lustre/ost/*.[ch]

==ldiskfs==

'''Summary'''

ldiskfs is local disk filesystem built on top of ext3. it adds extents support to ext3, multiblock allocator, multimount protection and iopen features.

'''Code'''

There is no ldiskfs source code in the Lustre repositories (only patches). Instead, ext3 code is copied from your build kernel, the patches are applied and then whole thing gets renamed to ldiskfs. For details, go to ldiskfs/.

==fsfilt==

'''Summary'''

The fsfilt layer abstracts the backing filesystem specifics away from the obdfilter and mds code in 1.4 and 1.6 lustre. This avoids linking the obdfilter and mds directly against the filesystem module, and in theory allows different backing filesystems, but in practise this was never implemented. In Lustre 1.8 and later this code is replaced by the OSD layer.
 
 
There is a core fsfilt module which can auto-load the backing filesystem type based on the type specified during configuration. This loads a filesystem-specific fsfilt_{fstype} module with a set of methods for that filesystem.
 
 
There are a number of different kinds of methods:

* Get/set filesystem label and UUID for identifying the backing filesystem
* Start, extend, commit compound filesystem transactions to allow multi-file updates to be atomic for recovery
* Set a journal callback for transaction disk commit (for Lustre recovery)
* Store attributes in the inode (possibly avoiding side-effects like truncation when setting the inode size to zero)
* Get/set file attributes (EAs) for storing LOV and OST info (e.g. striping)
* Perform low-level IO on the file (avoiding cache)
* Get/set file version (for future recovery mechanisms)
* Access quota information

'''Code'''

The files used for the fsfilt code reside in:

lustre/lvfs/fsfilt.c - Interface used by obdfilter/MDS, module autoloading
lustre/lvfs/fsfilt_ext3.c - Interface to ext3/ldiskfs filesystem

The ''fsfilt_ldiskfs.c'' file is auto-generated from ''fsfilt_ext3.c'' in ''lustre/lvfs/autoMakefile.am'' using sed to replace instances of ext3 and EXT3 with ldiskfs, and a few other replacements to avoid symbol clashes.

==ldiskfs OSD==

'''Summary'''

ldiskfs-OSD is an implementation of dt_{device,object} interfaces on top of (modified) ldiskfs file-system.
 
 
It uses standard ldiskfs/ext3 code to do file I/O.
 
 
It supports 2 types of indices (in the same file system):

* iam-based index: this is an extension of ext3 htree directory format with support for more general keys and values, and with relaxed size restrictions, and
* compatibility index: this is usual ldiskfs directory, accessible through dt_index_operations.

ldiskfs-OSD uses read-write mutex to serialize compound operations.
</blockquote>

'''Code'''
lustre/include/dt_object.h
lustre/osd/osd_internal.h
lustre/osd/osd_handler.c

==DMU OSD==

'''Summary'''

This is another implementation of the OSD API for userspace DMU. It uses DMU's ZAP for indices.

'''Code'''
dmu-osd/*.[ch] in b_hd_dmu branch

==DMU==

'''Summary'''

The DMU is one of the layers in Sun's ZFS filesystem which is responsible for presenting a transactional object store to its consumers. It is used as Lustre's backend object storage mechanism for the userspace MDSs and OSSs.
 
 
The ZFS community page has a source tour which is useful as an introduction to the several ZFS layers: [http://www.opensolaris.org/os/community/zfs/source/ ZFS source]
 
 
There are many useful resources in that community page.
 
 
For reference, here's a list of DMU features:

* Atomic transactions
* End-to-end data and metadata checksumming (currently supports fletcher2, fletcher4 and sha-256)
* Compression (currently supports lzjb and gzip with compression levels 1..9)
* Snapshots and clones
* Variable block sizes (currently supports sector sizes from 512 bytes to 128KB)
* Integrated volume management with support for RAID-1, RAID-Z and RAID-Z2 and striping
* Metadata and optional data redundancy (ditto blocks) atop the inherent storage pool redundancy for high resilience
* Self-healing, which works due to checksumming, ditto blocks and pool redundancy
* Storage devices that act as level-2 caches (designed for flash storage)
* Hot spares
* Designed with scalability in mind - supports up to 2^64 bytes per object, 2^48 objects per filesystem, 2^64 filesystems per pool, 2^64 bytes per device, 2^64 devices per pool, ..
* Very easy to use admin interface (zfs and zpool commands)

'''Code'''
src/
source code

src/cmd/ - ZFS/DMU related programs
src/cmd/lzfs/ - lzfs, the filesystem administration utility
src/cmd/lzpool/ - lzpool, the pool administration utility
src/cmd/lzdb/ - lzdb, the zfs debugger
src/cmd/lztest/ - lztest, the DMU test suite
src/cmd/lzfsd/ - lzfsd, the ZFS daemon

src/lib/ - Libraries
src/lib/port/ - Portability layer
src/lib/solcompat/ - Solaris -> Linux portability layer (deprecated, use libport instead)
src/lib/avl/ - AVL trees, used in many places in the DMU code
src/lib/nvpair/ - Name-value pairs, used in many places in the DMU code
src/lib/umem/ - Memory management library
src/lib/zpool/ - Main ZFS/DMU code
src/lib/zfs/ - ZFS library used by the lzfs and lzpool utilities
src/lib/zfscommon/ - Common ZFS code between libzpool and libzfs
src/lib/ctl/ - Userspace control/management interface
src/lib/udmu/ - Lustre uDMU code (thin library around the DMU)

src/scons/ - local copy of SCons

tests/regression/ - Regression tests

misc/ - miscellaneous files/scripts

==obdfilter==

'''Summary'''

obdfilter is a core component of OST (data server) making underlying disk filesystem a part of distributed system:

* Maintains cluster-wide coherency for data
* Maintains space reservation for data in client's cache (grants)
* Maintains quota

'''Code'''
lustre/obdfilter/*.[ch]

==MDS==

'''Summary'''

The MDS service in Lustre 1.4 and 1.6 is a monolithic body of code that provides multiple functions related to filesystem metadata. It handles the incoming RPCs and service threads for metadata operations (create, rename, unlink, readdir, etc), interfaces with the Lustre lock manager ([https://wikis.clusterfs.com/intra/index.php/Ldlm_summary DLM]), and also manages the underlying filesystem (via the [https://wikis.clusterfs.com/intra/index.php/Fsfilt_summary interface fsfilt] interface).
 
 
The MDS is the primary point of access control for clients, allocates the objects belonging to a file (in conjunction with [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV]) and passing that information to the clients when they access a file.
 
 
The MDS is also ultimately responsible for deleting objects on the OSTs, either by passing object information for destroy to the client removing the last link or open reference on a file and having the client do it, or by destroying the objects on the OSTs itself in case the client fails to do so.
 
 
In the 1.8 and later releases, the functionality provided by the MDS code has been split into multiple parts ([https://wikis.clusterfs.com/intra/index.php/Mdt_summary MDT], [https://wikis.clusterfs.com/intra/index.php/Mdd_summary MDD], OSD) in order to allow stacking of the metadata devices for clustered metadata.

'''Code'''
lustre/mds/commit_confd.c
lustre/mds/handler.c - RPC request handler
lustre/mds/lproc_mds.c - /proc interface for stats/control
lustre/mds/mds_fs.c - Mount/configuration of underlying filesystem
lustre/mds/mds_internal.h - Header for internal declarations
lustre/mds/mds_join.c - Handle join_file operations
lustre/mds/mds_lib.c - Unpack of wire structs from requests
lustre/mds/mds_log.c - Lustre log interface (llog) for unlink/setattr
lustre/mds/mds_lov.c - Interface to LOV for create and orphan
lustre/mds/mds_open.c - File open/close handling
lustre/mds/mds_reint.c - Reintegration of changes made by clients
lustre/mds/mds_unlink_open.c - Handling of open-unlinked files (PENDING dir)
lustre/mds/mds_xattr.c - User-extended attribute handling

==MDT==

'''Summary'''

MDT stands for MetaData Target. This is a top-most layer in the MD server device stack. Responsibility of MDT are all this networking, as far as meta-data are concerned:

* Managing PTLRPC services and threads;
* Receiving incoming requests, unpacking them and checking their validity;
* Sending replies;
* Handling recovery;
* Using DLM to guarantee cluster-wide meta-data consistency;
* Handling intents;
* Handling credential translation.

Theoretically MDT is an optional layer: completely local Lustre setup, with single mete-data server, and locally mounted client can exist without MDT (and still use networking for non-metadata access).

'''Code'''
lustre/mdt/mdt.mod.c
lustre/mdt/mdt_capa.c
lustre/mdt/mdt_handler.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_internal.h
lustre/mdt/mdt_lib.c
lustre/mdt/mdt_lproc.c
lustre/mdt/mdt_open.c
lustre/mdt/mdt_recovery.c
lustre/mdt/mdt_reint.c
lustre/mdt/mdt_xattr.c

==CMM==

'''Summary'''

'''Overview'''

The CMM is a new layer in the MDS which cares about all clustered metadata issues and relationships. The CMM does the following:

* Acts as layer between the MDT and MDD.
* Provides MDS-MDS interaction.
* Queries and updates FLD.
* Does the local or remote operation if needed.
* Will do rollback - epoch control, undo logging.

'''CMM functionality'''

CMM chooses all servers involved in operation and sends depended request if needed. The calling of remote MDS is a new feature related to the CMD. CMM mantain the list of MDC to connect with all other MDS.

'''Objects'''

The CMM can allocate two types of object - local and remote. Remote object can occur during metadata operations with more than one object involved. Such operation is called as cross-ref operation.

'''Code'''
lustre/cmm

==MDD==

'''Summary'''

MDD is metadata layer in the new MDS stack, which is the only layer operating the metadata in MDS. The implementation is similar as VFS meta operation but based on OSD storage. MDD API is currently only used in new MDS stack, called by CMM layer.
 
 
In theory, MDD should be local metadata layer, but for compatibility with old MDS stack and reuse some mds codes(llog and lov), a mds device is created and connected to the mdd. So the llog and lov in mdd still use original code through this temporary mds device. And it will be removed when the new llog and lov layer in the new MDS stack are implemented.

'''Code'''
lustre/lustre/mdd/

==recovery==

'''Summary'''

'''Overview'''

Client recovery starts in case when no server reply is received within given timeout or when server tells to client that it is not connected (client was evicted on server earlier for whatever reason).
 
 
The recovery consists of trying to connect to server and then step through several recovery states during which various client-server data is synchronized, namely all requests that were already sent to server but not yet confirmed as received and DLM locks. Should any problems arise during recovery process (be it a timeout or server’s refuse to recognise client again), the recovery is restarted from the very beginning.
 
 
During recovery all new requests to the server are not sent to the server, but added to special delayed requests queue that is then sent once if recovery completes succesfully.

'''Replay and Resend'''

* Clients will go through all the requests in the sending and replay lists and determine the recovery action needed - replay request, resend request, cleanup up associated state for committed requests.
* The client replays requests which were not committed on the server, but for which the client saw reply from server before it failed. This allows the server to replay the changes to the persistent store.
* The client resends requests that were committed on the server, but the client did not see a reply for them, maybe due to server failure or network failure that caused the reply to be lost. This allows the server to reconstruct the reply and send it to the client.
* The client resends requests that the server has not seen at all, these would be all requests with transid higher than the last_rcvd value from the server and the last_committed transno, and the reply seen flag is not set.
* The client gets the last_committed transno information from the server and cleans up the state associated with requests that were committed on the server.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==version recovery==

'''Summary'''

'''Version Based Recovery'''

This recovery technique is based on using versions of objects (inodes) to allow clients to recover later than ordinary server recovery timeframe.

# The server changes the version of object during any change and return that data to client. The version may be checked during replay to be sure that object is the same state during replay as it was originally.
# After failure the server starts recovery as usual but if some client miss the version check will be used for replays.
# Missed client can connect later and try to recover. This is 'delayed recovery' and version check is used during it always.
# The client which missed main recovery window will not be evicted and can connect later to initiate recovery. In that case the versions will checked to determine was that object changed by someone else or not.
# When finished with replay, client and server check if any replay failed on any request because of version mismatch. If not, the client will get a successful reintegration message. If a version mismatch was encountered, the client must be evicted.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==IAM==

'''Summary'''

IAM stands for 'Index Access Module': it is an extension to the ldiskfs directory code, adding generic indexing capability.
 
 
File system directory can be thought of as an index mapping keys, which are strings (file names), to the records which are integers (inode numbers). IAM removes limitations on key and record size and format, providing an abstraction of a transactional container, mapping arbitrary opaque keys into opaque records.
 
 
Implementation notes:

* IAM is implemented as a set of patches to the ldiskfs;
* IAM is an extension of ldiskfs directory code that uses htree data-structure for scalable indexing;
* IAM uses fine-grained key-level and node-level locking (pdirops locking, designed and implemented by Alex Tomas);
* IAM doesn't assume any internal format keys. Keys are compared by memcmp() function (which dictates BE order for scalars);
* IAM supports different flavors of containers:
** lfix: fixed size record and fixed size keys,
** lvar: variable sized records and keys,
** htree: compatibility mode, allowing normal htree directory to be accessed as an IAM container;
* IAM comes with ioctl(2) based user-level interface.

IAM is used by ldiskfs-OSD to implement dt_index_operations interface.
</blockquote>

'''Code'''
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-ops.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-rhel5.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-vanilla.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-separate.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.9-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-common.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-uapi.patch

==SOM==

'''Summary'''

Size-on-MDS is a metadata improvement, which includes the caching of the inode size, blocks, ctime and mtime on MDS. Such an attribute caching allows clients to avoid making RPCs to the OSTs to find the attributes encoded in the file objects kept on those OSTs what results in the significantly improved performance of listing directories.

'''Code'''
llite/llite_close.c - client-side SOM code
liblustre/file.c - liblustre SOM code
mdt/mdt_handler.c - general handling of SOM-related rpc
mdt/mdt_open.c - MDS side SOM code
mdt/mdt_recovery.c - MDS side SOM recovery code
obdfilter/filter_log.c - OST side IO epoch lloging code;

==tests==

'''Summary'''

The "tests" subsystem is a set of scripts and programs which is used to test other lustre subsystems. It contains:

'''''runtests'''''
 
Simple basic regression test
 
 
'''''sanity'''''
 
A set of regression tests that verify operation under normal operating
conditions
 
 
'''''fsx'''''
 
file system exerciser
 
 
'''''sanityn'''''
 
Tests that verify operations from two clients under normal operating conditions
 
 
'''''lfsck'''''
 
Tests e2fsck and lfsck to detect and fix filesystem corruption
 
 
'''''liblustre'''''
 
Runs a test linked to a liblustre client library
 
 
'''''replay-single'''''
 
A set of unit tests that verify recovery after MDS failure
 
 
'''''conf-sanity'''''
 
A set of unit tests that verify the configuration
 
 
'''''recovery-small'''''
 
A set of unit tests that verify RPC replay after communications failure
 
 
'''''replay-ost-single'''''
 
A set of unit tests that verify recovery after OST failure
 
 
'''''replay-dual'''''
 
A set of unit tests that verify the recovery from two clients after server failure
 
 
'''''insanity'''''
 
A set of tests that verify the multiple concurrent failure conditions
 
 
'''''sanity-quota'''''
 
A set of tests that verify filesystem quotas
 
 
The acceptance-small.sh is a wrapper which is normally used to run all (or any) of these scripts. In additional it is used to run the following pre-installed benchmarks:
 
 
'''''dbench'''''
 
Dbench benchmark for simulating N clients to produce the filesystem load
 
 
'''''bonnie'''''
 
Bonnie++ benchmark for creation, reading, and deleting many small files
 
 
'''''iozone'''''
 
Iozone benchmark for generating and measuring a variety of file operations.

'''Code'''
lustre/tests/acl/run
lustre/tests/acl/make-tree
lustre/tests/acl/README
lustre/tests/acl/setfacl.test
lustre/tests/acl/getfacl-noacl.test
lustre/tests/acl/permissions.test
lustre/tests/acl/inheritance.test
lustre/tests/acl/misc.test
lustre/tests/acl/cp.test
lustre/tests/cfg/local.sh
lustre/tests/cfg/insanity-local.sh
lustre/tests/ll_sparseness_write.c
lustre/tests/writeme.c
lustre/tests/cobd.sh
lustre/tests/test_brw.c
lustre/tests/ll_getstripe_info.c
lustre/tests/lov-sanity.sh
lustre/tests/sleeptest.c
lustre/tests/flocks_test.c
lustre/tests/getdents.c
lustre/tests/ll_dirstripe_verify.c
lustre/tests/sanity.sh
lustre/tests/multifstat.c
lustre/tests/sanityN.sh
lustre/tests/liblustre_sanity_uml.sh
lustre/tests/fsx.c
lustre/tests/small_write.c
lustre/tests/socketserver
lustre/tests/cmknod.c
lustre/tests/README
lustre/tests/acceptance-metadata-double.sh
lustre/tests/writemany.c
lustre/tests/llecho.sh
lustre/tests/lfscktest.sh
lustre/tests/run-llog.sh
lustre/tests/conf-sanity.sh
lustre/tests/mmap_sanity.c
lustre/tests/write_disjoint.c
lustre/tests/ldaptest.c
lustre/tests/acceptance-metadata-single.sh
lustre/tests/compile.sh
lustre/tests/mcreate.c
lustre/tests/runas.c
lustre/tests/replay-single.sh
lustre/tests/lockorder.sh
lustre/tests/test2.c
lustre/tests/llog-test.sh
lustre/tests/fchdir_test.c
lustre/tests/mkdirdeep.c
lustre/tests/runtests
lustre/tests/flock.c
lustre/tests/mlink.c
lustre/tests/checkstat.c
lustre/tests/crash-mod.sh
lustre/tests/multiop.c
lustre/tests/random-reads.c
lustre/tests/disk1_4.zip
lustre/tests/rundbench
lustre/tests/wantedi.c
lustre/tests/rename_many.c
lustre/tests/leak_finder.pl
lustre/tests/Makefile.am
lustre/tests/parallel_grouplock.c
lustre/tests/chownmany.c
lustre/tests/ost_oos.sh
lustre/tests/mkdirmany.c
lustre/tests/directio.c
lustre/tests/insanity.sh
lustre/tests/createmany-mpi.c
lustre/tests/createmany.c
lustre/tests/runiozone
lustre/tests/rmdirmany.c
lustre/tests/replay-ost-single.sh
lustre/tests/mcr.sh
lustre/tests/mrename.c
lustre/tests/sanity-quota.sh
lustre/tests/lp_utils.c
lustre/tests/lp_utils.h
lustre/tests/acceptance-metadata-parallel.sh
lustre/tests/oos.sh
lustre/tests/createdestroy.c
lustre/tests/toexcl.c
lustre/tests/replay-dual.sh
lustre/tests/createtest.c
lustre/tests/munlink.c
lustre/tests/iopentest1.c
lustre/tests/iopentest2.c
lustre/tests/openme.c
lustre/tests/openclose.c
lustre/tests/test-framework.sh
lustre/tests/ll_sparseness_verify.c
lustre/tests/it_test.c
lustre/tests/unlinkmany.c
lustre/tests/opendirunlink.c
lustre/tests/filter_survey.sh
lustre/tests/utime.c
lustre/tests/openunlink.c
lustre/tests/runvmstat
lustre/tests/statmany.c
lustre/tests/create.pl
lustre/tests/oos2.sh
lustre/tests/statone.c
lustre/tests/rename.pl
lustre/tests/set_dates.sh
lustre/tests/openfilleddirunlink.c
lustre/tests/openfile.c
lustre/tests/llmountcleanup.sh
lustre/tests/llmount.sh
lustre/tests/acceptance-small.sh
lustre/tests/truncate.c
lustre/tests/recovery-small.sh
lustre/tests/2ost.sh
lustre/tests/tchmod.c
lustre/tests/socketclient
lustre/tests/runobdstat
lustre/tests/memhog.c
lustre/tests/flock_test.c
lustre/tests/busy.sh
lustre/tests/write_append_truncate.c
lustre/tests/opendevunlink.c
lustre/tests/o_directory.c

==build==

'''Summary'''

The build system is responsible for building Lustre and related components (ldiskfs is normally included in the Lustre tree but can also live completely separately).
 
 
The main build process is managed using GNU Autoconf and Automake. Here is a brief outline of how a Lustre binary build from a fresh Git checkout works. User commands are shown in '''bold'''.

* '''sh autogen.sh''' - autogen performs a few checks and bootstraps the build system using automake and autoconf. It should only need to be called once for a fresh Git clone, but sometimes it needs to be run again. See [https://bugzilla.lustre.org/show_bug.cgi?id=12580 bug 12580].
**Each component (Lustre and ldiskfs) has an autoMakefile.am in its toplevel directory that sets some variables and includes build/autoMakefile.am.toplevel. It also contains any toplevel autoMake code unique to that component.
** configure.ac is used by autoconf to generate a configure script. The Lustre configure.ac mostly relies on the macros defined in */autoconf/*.m4 to do its work. The ldiskfs configure.ac is more self-contained and relies only on build/autoconf/*.m4.

* '''./configure --with-linux=/root/cfs/kernels/linux-2.6.9-55.EL.HEAD''' - Configure performs extensive checks of the underlying system and kernel, then produces autoMakefiles and Makefiles.

* '''make''' - This is where things get really interesting.
** The @INCLUDE_RULES@ directive in most Makefile.in files includes a whole set of build rules from build/Makefile. See the top of that file for a description of all cases.
** Normally, it will include autoMakefile, so commands from that file will run.
** build/autoMakefile.am.toplevel is the basis of the autoMakefile produced in the toplevel directory. It includes the "modules" target.
** The modules target in turn calls the appropriate Linux make system if we are building on Linux.
** This build system once again reads the Makefile in each directory, and case 2 from build/Makefile is followed.

So essentially, the Makefile.in controls the kernel build process, and the autoMakefile.am controls the userland build process as well as preparing the sources if necessary.
 
 
The build system can also be used to produce Lustre-patched kernels and binaries built against these kernels. The '''build/lbuild''' script does this - this is used by customers as well as the LTS. This script is in need of some serious cleanup, unfortunately.
 
 
Coding style note: as mentioned in [http://wiki.lustre.org/index.php?title=Coding_Guidelines Coding Guidelines], autoconf macros must follow the style specified in the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html GNU Autoconf manual]. A lot of the older code has inconsistent style and is hard to follow - feel free to reformat when needed. New code '''must''' be styled correctly.

'''Code'''

Lustre build system:

* build/* (shared with ldiskfs)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

ldiskfs build system:

* build/* (shared with Lustre)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre-ldiskfs.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

Subsystem Map

2009-12-15T22:36:38Z

Nathan: /* build */

The Lustre subsystems are listed below. For each subsystem, a summary description and code is provided.

==libcfs==

'''Summary'''
 
 
Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.

'''Code'''
lustre/lnet/libcfs/**/*.[ch]

==lnet==

'''Summary'''
 
 
See the [http://www.sun.com/software/products/lustre/docs/Lustre-networking.pdf Lustre Networking] white paper for details.

'''Code'''
lustre/lnet/**/*.[ch]

==ptlrpc==

'''Summary'''
 
 
Ptlrpc implements Lustre communications over LNET.
 
 
All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.
 
 
The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.
 
 
Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

'''Code'''
lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

==llog==

'''Summary'''

'''Overview'''

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.
 
 
LLog is used is various cases. The main LLog use cases are the following:

* mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;

* MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;

* Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);

* LOVEA llog - joins the file LOV EA merge log.

'''General design'''

Each llog type has two main parts:

* ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;

* REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

'''Code'''
obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c

For more information, see [[Logging API]].

==obdclass==

'''Summary'''

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.
 
 
Several examples of obd devices include:

* OSC - object storage client (connects over network to OST)
* OST - object storage target
* LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
* MDC - meta data client (connects over network to MDT)
* MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

'''Code'''
lustre/obdclass/class_hash.c - scalable hash code for imports
lustre/obdclass/class_obd.c - base device handling code
lustre/obdclass/debug.c - helper routines for dumping data structs
lustre/obdclass/genops.c - device allocation/configuration/connection
lustre/obdclass/linux-module. - linux kernel module handling
lustre/obdclass/linux-obdo.c - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c - /proc/sys configuration parameters
lustre/obdclass/lprocfs_status.c - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c - peer target identification by UUID
lustre/obdclass/obd_config.c - configuration file parsing
lustre/obdclass/obd_mount.c - server filesystem mounting
lustre/obdclass/obdo.c - more obdo handling helpers
lustre/obdclass/statfs_pack.c - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c - UUID pack/unpack
lustre/lvfs/lvfs_common.c - kernel interface helpers
lustre/lvfs/lvfs_darwin.c - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c - statistics
lustre/lvfs/lvfs_linux.c - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c - userspace helper routines
lustre/lvfs/prng.c - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c - supplementary group upcall for MDS

==luclass==

'''Summary'''

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.
 
 
Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

* Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;

* Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;

* Objects are reference counted, and cached;

* Every object has a list of ''layers'' (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.
 
 
luclass design is specified in the [http://arch.lustre.org/images/a/aa/Md-api-dld.pdf MD API] DLD.

'''Code'''
include/lu_object.h
obdclass/lu_object.c

==ldlm==

'''Summary'''

The Lustre Distributed Lock Manager (LDLM) is the Lustre locking infrastructure; it handles locks between clients and servers and locks local to a node. Different kinds of locks are available with different properties. Also as a historic heritage, ldlm happens to have some of the generic connection service code (both server and client).

'''Code'''
interval_tree.c - This is used by extent locks to maintain interval trees (bug 11300)
l_lock.c - Resourse locking primitives.
ldlm_extent.c - Extents locking code used for locking regions inside objects
ldlm_flock.c - Bsd and posix locking lock types
ldlm_inodebits.c - Inodebis locks used for metadata locking
ldlm_lib.c - Target and client connecting/reconnecting/recovery code.
Does not really belong to ldlm, but is historically placed
there. Should be in ptlrpc instead.
ldlm_lock.c - This source file mostly has functions dealing with struct
ldlm_lock ldlm_lockd.c - Functions that imply replying to incoming lock-related rpcs
(that could be both on server (lock enq/cancel/...) and client
(ast handling)).
ldlm_plain.c - Plain locks, predecessor to inodebits locks; not widely used now.
ldlm_pool.c - Pools of locks, related to dynamic lrus and freeing locks on demand.
ldlm_request.c - Collection of functions to work with locks based handles as opposed
to lock structures themselves.
ldlm_resource.c - Functions operating on namespaces and lock resources.
include/lustre_dlm.h - Important defines and declarations for ldlm.

==fids==

'''Summary'''

FID is unique object identifier in cluster since 1.7. It has few properties, main of them are the following:

* FID is unique and not reused object identifier;
* FID is allocated by client inside of the sequence granted by server;
* FID is base for ldlm resource used for issuing ldlm locks. This is because FID is unique and as such good for this using;
* FID is base for building client side inode numbers as we can't use server inode+generation anymore, in CMD this is not unique combination;
* FID does not contain store information like inode number or generation and as such easy to migrate;

FID consists of 3 fields:

* f_seq - sequence number
* f_oid - object identifier inside sequence
* f_ver - object version

'''Code'''
fid/fid_request.c
fid/fid_lib.c
fld/*.[ch]

==seq==

'''Summary'''

'''Overview'''

Sequence management is a basic mechanism in new MDS server which is related to managing FIDs.
 
 
FID is an unique object identifier in Lustre starting from version 1.7. All FIDs are organized into sequences. One sequence is number of FIDs. Sequences are granted/allocated to clients by servers. FIDs are allocated by clients inside granted sequence. All FIDs inside one sequence live on same MDS server and as such are one "migration unit" and one "indexing unit", meaning that FLD (FIDs Location Database) indexes them all using one sequence and thus has only one mapping entry for all FIDs in sequence. Please read section devoted to FIDs bellow in the root table to find more info on FLD service and FIDs.
 
 
A sequence has the limit of FIDs to be allocated in it. When this limit is reached, new sequence is allocated. Upon disconnect, server allocates new sequence to the client when it comes back. Previously used sequence is abandoned even if it was not exhausted. Sequences are valuable resource but in the case of recovery, using new sequence makes things easier and also allows to group FIDs and objects by working sessions, new connection - new sequence.

'''Code description'''

Server side code is divided into two parts:

* Sequence controller - allocates super-sequences, that is, sequences of sequences to all servers in cluster (currently only to MDSes as only they are new FIDs aware). Usually first MDS in cluster is sequence controller

* Sequence manager - allocates meta-sequences (smaller range of sequences inside a super-sequence) to all clients, using granted super-sequence from the sequence controller. All MDSs in the cluster (all servers in the future) are sequence managers. The first MDS is, simultaneously, a sequence controller and a sequence manager.

Client side code allocates new sequences from granted meta-sequence. When meta-sequence is exhausted, new one is allocated on server and sent to the client.
 
 
Client code consists of API for working with both server side parts, not only with sequence manager as all servers need to talk to sequence controller, they also use client API for this.
 
 
One important part of client API is FIDs allocation. New FID is allocated in currently granted sequence until sequence is exhausted.

'''Code'''
fid/fid_handler.c - server side sequence management code;
fid/fid_request.c - client side sequence management code;
fid/fid_lib.c - fids related miscellaneous stuff.

==mountconf==

'''Summary'''

MountConf is how servers and clients are set up, started, and configured. A MountConf usage document is [http://wiki.lustre.org/index.php?title=Mount_Conf here].
 
 
The major subsystems are the MGS, MGC, and the userspace tools mount.lustre and mkfs.lustre.
 
 
The basic idea is:

# Whenever any Lustre component is mount(2)ed, we start a MGC.
# This establishes a connection to the MGS and downloads a configuration llog.
# The MGC passes the configuration log through the parser to set up the other OBDs.
# The MGC holds a CR configuration lock, which the MGS recalls whenever a live configuration change is made.

'''Code'''

MountConf file areas:

lustre/mgs/*
lustre/mgc/*
lustre/obdclass/obd_mount.c
lustre/utils/mount_lustre.c
lustre/utils/mkfs_lustre.c

==liblustre==

'''Summary'''

Liblustre is a userspace library, used along with libsysio (developed by Sandia), that allows Lustre usage just by linking (or ld_preload'ing) applications with it. Liblustre does not require any kernel support. It is also used on old Cray XT3 machines (and not so old, in the case of Sandia), where all applications are just linked with the library and loaded into memory as the only code to run. Liblustre does not support async operations of any kind due to a lack of interrupts and other notifiers from lower levels to Lustre. Liblustre includes another set of LNDs that are able to work from userspace.

'''Code'''
dir.c - Directory operations
file.c - File handling operations (like open)
llite_lib.c - General support (init/cleanp/parse options)
lutil.c - Supplementary code to get IP addresses and init various structures
needed to emulate the normal Linux process from other layers' perspective.
namei.c - Metadata operations code.
rw.c - I/O code, including read/write
super.c - "Superblock" operation - mounting/umounting, inode operations.
tests - directory with liblustre-specific tests.

==echo client/server==

'''Summary'''

The echo_client and obdecho are OBD devices which help testing and performance measurement.
 
 
They were implemented originally for network testing - obdecho can replace obdfilter and echo_client can exercise any downstream configurations. They are normally used in the following configurations:

* echo_client -> obdfilter. This is used to measure raw backend performance without any network I/O.
* echo_client -> OSC -> <network> -> OST -> obdecho. This is used to measure network and ptlrpc performance.
* echo_client -> OSC -> <network> -> OST -> obdfilter. This is used to measure performance available to the Lustre client.

'''Code'''
lustre/obdecho/

==client vfs==

'''Summary'''

The client VFS interface, also called '''llite''', is the bridge between the Linux kernel and the underlying Lustre infrastructure represented by the [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV], [https://wikis.clusterfs.com/intra/index.php/Client_metadata_summary MDC], and [https://wikis.clusterfs.com/intra/index.php/Ldlm_summary LDLM] subsystems. This includes mounting the client filesystem, handling name lookups, starting file I/O, and handling file permissions.
 
 
The Linux VFS interface shares a lot in common with the liblustre interface, which is used in the Catamount environment; as of yet, the code for these two subsystems is not common and contains a lot of duplication.

'''Code'''
lustre/llite/dcache.c - Interface with Linux dentry cache/intents
lustre/llite/dir.c - readdir handling, filetype in dir, dir ioctl
lustre/llite/file.c - File handles, file ioctl, DLM extent locks
lustre/llite/llite_close.c - File close for opencache
lustre/llite/llite_internal.h - Llite internal function prototypes, structures
lustre/llite/llite_lib.c - Majority of request handling, client mount
lustre/llite/llite_mmap.c - Memory-mapped I/O
lustre/llite/llite_nfs.c - NFS export from clients
lustre/llite/lloop.c - Loop-like block device export from object
lustre/llite/lproc_llite.c - /proc interface for tunables, statistics
lustre/llite/namei.c - Filename lookup, intent handling
lustre/llite/rw24.c - Linux 2.4 IO handling routines
lustre/llite/rw26.c - Linux 2.6 IO handling routines
lustre/llite/rw.c - Linux generic IO handling routines
lustre/llite/statahead.c - Directory statahead for "ls -l" and "rm -r"
lustre/llite/super25.c - Linux 2.6 VFS file method registration
lustre/llite/super.c - Linux 2.4 VFS file method registration
lustre/llite/symlink.c - Symbolic links
lustre/llite/xattr.c - User-extended attributes

==client vm==

'''Summary'''

Client code interacts with VM/MM subsystems of the host OS kernel to cache data (in the form of pages), and to react to various memory-related events, like memory pressure.
 
 
Two key components of this interaction are:

* cfs_page_t data-type representing MM page. It comes together with the interface to map/unmap page to/from kernel virtual address space, access various per-page bits, like 'dirty', 'uptodate', etc., lock/unlock page. Currently, this data-type closely matches the Linux kernel page. It has to be straightened out, formalized, and expanded to include functionality like querying about total number of pages on a node, etc.
* MM page operations in cl_page (part of new client I/O interface).

'''Code'''

This describes the ''next generation'' Lustre client I/O code, which is expected to appear in Lustre 2.0. Code location is not finalized.
 
 
cfs_page_t interface is defined and implemented in:

lnet/include/libcfs/ARCH/ARCH-mem.h
lnet/libcfs/ARCH/ARCH-mem.c

Generic part of cl-page will be located in:

include/cl_object.h
obdclass/cl_page.c
obdclass/cl_object.c

Linux kernel implementation is currently in:

llite/llite_cl.c

==client I/O==

'''Summary'''

Client I/O is a group of interfaces used by various layers of a Lustre client to manage file data (as opposed to metadata). Main functions of these interfaces are:

* Cache data, respecting limitations imposed both by hosting MM/VM, and by cluster-wide caching policies, and
* Form a stream of efficient I/O RPCs, respecting both ordering/timing constraints imposed by the hosting VFS (e.g., POSIX guarantees, O_SYNC, etc.), and cluster-wide IO scheduling policies.

Client I/O subsystem interacts with VFS, VM/MM, DLM, and PTLRPC.
 
 
Client I/O interfaces are based on the following data-types:

* cl_object: represents a file system object, both a file, and a stripe;
* cl_page: represents a cached data page;
* cl_lock: represents an extent DLM lock;
* cl_io: represents an ongoing high-level IO activity, like read(2)/write(2) system call, or sub-io of another IO;
* cl_req: represents a network RPC.

'''Code'''

This describes the ''next generation'' Lustre client I/O code. The code location is not finalized. The generic part is at:

include/cl_object.h
obdclass/cl_object.c
obdclass/cl_page.c
obdclass/cl_lock.c
obdclass/cl_io.c

Layer-specific methods are currently at:

lustre/LAYER/LAYER_cl.c

where LAYER is one of llite, lov, osc.

==client metadata==

'''Summary'''

The Meta Data Client (MDC) is the client-side interface for all operations related to the Meta Data Server MDS. In current configurations there is a single MDC on the client for each filesystem mounted on the client. The MDC is responsible for enqueueing metadata locks (via LDLM), and packing and unpacking messages on the wire.
 
 
In order to ensure a recoverable system, the MDC is limited at the client to only a single filesystem-modifying operation in flight at one time. This includes operations like create, rename, link, unlink, setattr.
 
 
For non-modifying operations like getattr and statfs the client can multiple RPC requests in flight at one time, limited by a tunable on the client, to avoid overwhelming the MDS.

'''Code'''
lustre/mdc/lproc_mdc.c - /proc interface for stats/tuning
lustre/mdc/mdc_internal.h - Internal header for prototypes/structs
lustre/mdc/mdc_lib.c - Packing of requests to MDS
lustre/mdc/mdc_locks.c - Interface to LDLM and client VFS intents
lustre/mdc/mdc_reint.c - Modifying requests to MDS
lustre/mdc/mdc_request.c - Non-modifying requests to MDS

==client lmv==

'''Summary'''

LMV is a module which implements CMD client-side abstraction device. It allows client to work with many MDSes without any changes in Llite module and even without knowing that CMD is supported. Llite just translates Linux VFS requests into metadata API calls and forwards them down to the stack.
 
 
As LMV needs to know which MDS to talk for any particular operation, it uses some new services introduced in CMD3 times. They are:

* FLD (Fids Location Database) - having FID or rather its sequence, lookup MDS number where this FID is located;
* SEQ (Client Sequence Manager) - LMV uses this via children MDCs for allocating new sequences and FIDs.

LMV supports split objects. This means that for every split directory it creates special in-memory structure which contains information about object stripes. This includes MDS number, FID, etc. All consequent operations use these structures for determining what MDS should be used for particular action (create, take lock, etc).

'''Code'''
lmv/*.[ch]

==lov==

'''Summary'''

The LOV device presents a single virtual device interface to upper layers (llite, liblustre, MDS). The LOV code is responsible for splitting of requests to the correct OSTs based on striping information (lsm), and the merging of the replies to a single result to pass back to the higher layer.
 
 
It calculates per-object membership and offsets for read/write/truncate based on the virtual file offset passed from the upper layer. It is also responsible for splitting the locking across all servers as needed.
 
 
The LOV on the MDS is also involved in object allocation.

'''Code'''
lustre/lov/lov_ea.c - Striping attributes pack/unpack/verify
lustre/lov/lov_internal.h - Header for internal function prototypes/structs
lustre/lov/lov_merge.c - Struct aggregation from many objects
lustre/lov/lov_obd.c - Base LOV device configuration
lustre/lov/lov_offset.c - File offset and object calculations
lustre/lov/lov_pack.c - Pack/unpack of striping attributes
lustre/lov/lov_qos.c - Object allocation for different OST loading
lustre/lov/lov_request.c - Request handling/splitting/merging
lustre/lov/lproc_lov.c - /proc/fs/lustre/lov tunables/statistics

==quota==

'''Summary'''

Quotas allow a system administrator to limit the maximum amount of disk space a user or group can consume. Quotas are set by root, and can be specified for individual users and/or groups. Quota limits can be set on both blocks and inodes.
 
 
Lustre quota enforcement differs from standard Linux quota support in several ways:

* Lustre quota are administered via the lfs command, whereas standard Linux quota uses the quotactl interface.
* As Lustre is a distributed filesystem, lustre quotas are also distributed in order to limit the impact on performance.
* Quotas are allocated and consumed in a quantized fashion.

'''Code'''

Quota core:

lustre/quota/quota_adjust_qunit.c
lustre/quota/quota_check.c
lustre/quota/quotacheck_test.c
lustre/quota/quota_context.c
lustre/quota/quota_ctl.c
lustre/quota/quota_interface.c
lustre/quota/quota_internal.h
lustre/quota/quota_master.c

Interactions with the underlying ldiskfs filesystem:

lustre/lvfs/fsfilt_ext3.c
lustre/lvfs/lustre_quota_fmt.c
lustre/lvfs/lustre_quota_fmt_convert.c

Hooks under:

lustre/mds
lustre/obdfilter

Regression tests:

lustre/tests/sanity-quota.sh

==security-gss==

'''Summary'''

The secure ptlrpc (sptlrpc) is a framework inside of ptlrpc layer. It act upon both side of each ptlrpc connection between 2 nodes, doing transformation on every RPC message, turn this into a secure communication link. By using GSS, sptlrpc is able to support multiple authentication mechanism, but currently we only support Kerberos 5.
 
 
Supported security flavors:

* null: no authentication, no data transform, thus no performance overhead; compatible with 1.6;
* plain: no authentication, simple data transform, minimal performance overhead;
* krb5x: per-user basis client-server mutual authentication using Kerberos 5, sign or encrypt data, could have substantial CPU overhead.

'''Code'''
lustre/ptlrpc/sec*.c
lustre/ptlrpc/gss/
lustre/utils/gss/

==security-capa==

'''Summary'''

Capabilities are pieces of data generated by one service - the master service, passed to a client and presented by the client to another service - the slave service, to authorize an action. It is independent from the R/W/X permission based file operation authorization.

'''Code'''
lustre/llite/llite_capa.c
lustre/mdt/mdt_capa.c
lustre/obdfilter/filter_capa.c
lustre/obdclass/capa.c
lustre/include/lustre_capa.h

==security-identity==

'''Summary'''

Lustre identity is a miscellaneous framework for lustre file operation authorization. Generally, it can be divided into two parts:

* User-identity parse / upcall / mapping.
* File operation permission maintenance and check, includes the traditional file mode based permission and ACL based permission.

'''Code'''
/llite/llite_rmtacl.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_lib.c
lustre/obdclass/idmap.c
lustre/utils/l_getidentity.c
lustre/include/lustre_idmap.h

lustre/llite/xattr.c
lustre/mdt/mdt_xattr.c
lustre/cmm/cmm_object.c
lustre/cmm/mdc_object.c
lustre/mdd/mdd_permission.c
lustre/mdd/mdd_object.c
lustre/mdd/mdd_dir.c
lustre/obdclass/acl.c
lustre/include/lustre_eacl.h

==OST==

'''Summary'''

OST is a very thin layer of data server. Its main responsibility is to translate RPCs to local calls of obdfilter, i.e. RPC parsing.

'''Code'''
lustre/ost/*.[ch]

==ldiskfs==

'''Summary'''

ldiskfs is local disk filesystem built on top of ext3. it adds extents support to ext3, multiblock allocator, multimount protection and iopen features.

'''Code'''

There is no ldiskfs source code in the Lustre repositories (only patches). Instead, ext3 code is copied from your build kernel, the patches are applied and then whole thing gets renamed to ldiskfs. For details, go to ldiskfs/.

==fsfilt==

'''Summary'''

The fsfilt layer abstracts the backing filesystem specifics away from the obdfilter and mds code in 1.4 and 1.6 lustre. This avoids linking the obdfilter and mds directly against the filesystem module, and in theory allows different backing filesystems, but in practise this was never implemented. In Lustre 1.8 and later this code is replaced by the OSD layer.
 
 
There is a core fsfilt module which can auto-load the backing filesystem type based on the type specified during configuration. This loads a filesystem-specific fsfilt_{fstype} module with a set of methods for that filesystem.
 
 
There are a number of different kinds of methods:

* Get/set filesystem label and UUID for identifying the backing filesystem
* Start, extend, commit compound filesystem transactions to allow multi-file updates to be atomic for recovery
* Set a journal callback for transaction disk commit (for Lustre recovery)
* Store attributes in the inode (possibly avoiding side-effects like truncation when setting the inode size to zero)
* Get/set file attributes (EAs) for storing LOV and OST info (e.g. striping)
* Perform low-level IO on the file (avoiding cache)
* Get/set file version (for future recovery mechanisms)
* Access quota information

'''Code'''

The files used for the fsfilt code reside in:

lustre/lvfs/fsfilt.c - Interface used by obdfilter/MDS, module autoloading
lustre/lvfs/fsfilt_ext3.c - Interface to ext3/ldiskfs filesystem

The ''fsfilt_ldiskfs.c'' file is auto-generated from ''fsfilt_ext3.c'' in ''lustre/lvfs/autoMakefile.am'' using sed to replace instances of ext3 and EXT3 with ldiskfs, and a few other replacements to avoid symbol clashes.

==ldiskfs OSD==

'''Summary'''

ldiskfs-OSD is an implementation of dt_{device,object} interfaces on top of (modified) ldiskfs file-system.
 
 
It uses standard ldiskfs/ext3 code to do file I/O.
 
 
It supports 2 types of indices (in the same file system):

* iam-based index: this is an extension of ext3 htree directory format with support for more general keys and values, and with relaxed size restrictions, and
* compatibility index: this is usual ldiskfs directory, accessible through dt_index_operations.

ldiskfs-OSD uses read-write mutex to serialize compound operations.
</blockquote>

'''Code'''
lustre/include/dt_object.h
lustre/osd/osd_internal.h
lustre/osd/osd_handler.c

==DMU OSD==

'''Summary'''

This is another implementation of the OSD API for userspace DMU. It uses DMU's ZAP for indices.

'''Code'''
dmu-osd/*.[ch] in b_hd_dmu branch

==DMU==

'''Summary'''

The DMU is one of the layers in Sun's ZFS filesystem which is responsible for presenting a transactional object store to its consumers. It is used as Lustre's backend object storage mechanism for the userspace MDSs and OSSs.
 
 
The ZFS community page has a source tour which is useful as an introduction to the several ZFS layers: [http://www.opensolaris.org/os/community/zfs/source/ ZFS source]
 
 
There are many useful resources in that community page.
 
 
For reference, here's a list of DMU features:

* Atomic transactions
* End-to-end data and metadata checksumming (currently supports fletcher2, fletcher4 and sha-256)
* Compression (currently supports lzjb and gzip with compression levels 1..9)
* Snapshots and clones
* Variable block sizes (currently supports sector sizes from 512 bytes to 128KB)
* Integrated volume management with support for RAID-1, RAID-Z and RAID-Z2 and striping
* Metadata and optional data redundancy (ditto blocks) atop the inherent storage pool redundancy for high resilience
* Self-healing, which works due to checksumming, ditto blocks and pool redundancy
* Storage devices that act as level-2 caches (designed for flash storage)
* Hot spares
* Designed with scalability in mind - supports up to 2^64 bytes per object, 2^48 objects per filesystem, 2^64 filesystems per pool, 2^64 bytes per device, 2^64 devices per pool, ..
* Very easy to use admin interface (zfs and zpool commands)

'''Code'''
src/
source code

src/cmd/ - ZFS/DMU related programs
src/cmd/lzfs/ - lzfs, the filesystem administration utility
src/cmd/lzpool/ - lzpool, the pool administration utility
src/cmd/lzdb/ - lzdb, the zfs debugger
src/cmd/lztest/ - lztest, the DMU test suite
src/cmd/lzfsd/ - lzfsd, the ZFS daemon

src/lib/ - Libraries
src/lib/port/ - Portability layer
src/lib/solcompat/ - Solaris -> Linux portability layer (deprecated, use libport instead)
src/lib/avl/ - AVL trees, used in many places in the DMU code
src/lib/nvpair/ - Name-value pairs, used in many places in the DMU code
src/lib/umem/ - Memory management library
src/lib/zpool/ - Main ZFS/DMU code
src/lib/zfs/ - ZFS library used by the lzfs and lzpool utilities
src/lib/zfscommon/ - Common ZFS code between libzpool and libzfs
src/lib/ctl/ - Userspace control/management interface
src/lib/udmu/ - Lustre uDMU code (thin library around the DMU)

src/scons/ - local copy of SCons

tests/regression/ - Regression tests

misc/ - miscellaneous files/scripts

==obdfilter==

'''Summary'''

obdfilter is a core component of OST (data server) making underlying disk filesystem a part of distributed system:

* Maintains cluster-wide coherency for data
* Maintains space reservation for data in client's cache (grants)
* Maintains quota

'''Code'''
lustre/obdfilter/*.[ch]

==MDS==

'''Summary'''

The MDS service in Lustre 1.4 and 1.6 is a monolithic body of code that provides multiple functions related to filesystem metadata. It handles the incoming RPCs and service threads for metadata operations (create, rename, unlink, readdir, etc), interfaces with the Lustre lock manager ([https://wikis.clusterfs.com/intra/index.php/Ldlm_summary DLM]), and also manages the underlying filesystem (via the [https://wikis.clusterfs.com/intra/index.php/Fsfilt_summary interface fsfilt] interface).
 
 
The MDS is the primary point of access control for clients, allocates the objects belonging to a file (in conjunction with [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV]) and passing that information to the clients when they access a file.
 
 
The MDS is also ultimately responsible for deleting objects on the OSTs, either by passing object information for destroy to the client removing the last link or open reference on a file and having the client do it, or by destroying the objects on the OSTs itself in case the client fails to do so.
 
 
In the 1.8 and later releases, the functionality provided by the MDS code has been split into multiple parts ([https://wikis.clusterfs.com/intra/index.php/Mdt_summary MDT], [https://wikis.clusterfs.com/intra/index.php/Mdd_summary MDD], OSD) in order to allow stacking of the metadata devices for clustered metadata.

'''Code'''
lustre/mds/commit_confd.c
lustre/mds/handler.c - RPC request handler
lustre/mds/lproc_mds.c - /proc interface for stats/control
lustre/mds/mds_fs.c - Mount/configuration of underlying filesystem
lustre/mds/mds_internal.h - Header for internal declarations
lustre/mds/mds_join.c - Handle join_file operations
lustre/mds/mds_lib.c - Unpack of wire structs from requests
lustre/mds/mds_log.c - Lustre log interface (llog) for unlink/setattr
lustre/mds/mds_lov.c - Interface to LOV for create and orphan
lustre/mds/mds_open.c - File open/close handling
lustre/mds/mds_reint.c - Reintegration of changes made by clients
lustre/mds/mds_unlink_open.c - Handling of open-unlinked files (PENDING dir)
lustre/mds/mds_xattr.c - User-extended attribute handling

==MDT==

'''Summary'''

MDT stands for MetaData Target. This is a top-most layer in the MD server device stack. Responsibility of MDT are all this networking, as far as meta-data are concerned:

* Managing PTLRPC services and threads;
* Receiving incoming requests, unpacking them and checking their validity;
* Sending replies;
* Handling recovery;
* Using DLM to guarantee cluster-wide meta-data consistency;
* Handling intents;
* Handling credential translation.

Theoretically MDT is an optional layer: completely local Lustre setup, with single mete-data server, and locally mounted client can exist without MDT (and still use networking for non-metadata access).

'''Code'''
lustre/mdt/mdt.mod.c
lustre/mdt/mdt_capa.c
lustre/mdt/mdt_handler.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_internal.h
lustre/mdt/mdt_lib.c
lustre/mdt/mdt_lproc.c
lustre/mdt/mdt_open.c
lustre/mdt/mdt_recovery.c
lustre/mdt/mdt_reint.c
lustre/mdt/mdt_xattr.c

==CMM==

'''Summary'''

'''Overview'''

The CMM is a new layer in the MDS which cares about all clustered metadata issues and relationships. The CMM does the following:

* Acts as layer between the MDT and MDD.
* Provides MDS-MDS interaction.
* Queries and updates FLD.
* Does the local or remote operation if needed.
* Will do rollback - epoch control, undo logging.

'''CMM functionality'''

CMM chooses all servers involved in operation and sends depended request if needed. The calling of remote MDS is a new feature related to the CMD. CMM mantain the list of MDC to connect with all other MDS.

'''Objects'''

The CMM can allocate two types of object - local and remote. Remote object can occur during metadata operations with more than one object involved. Such operation is called as cross-ref operation.

'''Code'''
lustre/cmm

==MDD==

'''Summary'''

MDD is metadata layer in the new MDS stack, which is the only layer operating the metadata in MDS. The implementation is similar as VFS meta operation but based on OSD storage. MDD API is currently only used in new MDS stack, called by CMM layer.
 
 
In theory, MDD should be local metadata layer, but for compatibility with old MDS stack and reuse some mds codes(llog and lov), a mds device is created and connected to the mdd. So the llog and lov in mdd still use original code through this temporary mds device. And it will be removed when the new llog and lov layer in the new MDS stack are implemented.

'''Code'''
lustre/lustre/mdd/

==recovery==

'''Summary'''

'''Overview'''

Client recovery starts in case when no server reply is received within given timeout or when server tells to client that it is not connected (client was evicted on server earlier for whatever reason).
 
 
The recovery consists of trying to connect to server and then step through several recovery states during which various client-server data is synchronized, namely all requests that were already sent to server but not yet confirmed as received and DLM locks. Should any problems arise during recovery process (be it a timeout or server’s refuse to recognise client again), the recovery is restarted from the very beginning.
 
 
During recovery all new requests to the server are not sent to the server, but added to special delayed requests queue that is then sent once if recovery completes succesfully.

'''Replay and Resend'''

* Clients will go through all the requests in the sending and replay lists and determine the recovery action needed - replay request, resend request, cleanup up associated state for committed requests.
* The client replays requests which were not committed on the server, but for which the client saw reply from server before it failed. This allows the server to replay the changes to the persistent store.
* The client resends requests that were committed on the server, but the client did not see a reply for them, maybe due to server failure or network failure that caused the reply to be lost. This allows the server to reconstruct the reply and send it to the client.
* The client resends requests that the server has not seen at all, these would be all requests with transid higher than the last_rcvd value from the server and the last_committed transno, and the reply seen flag is not set.
* The client gets the last_committed transno information from the server and cleans up the state associated with requests that were committed on the server.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==version recovery==

'''Summary'''

'''Version Based Recovery'''

This recovery technique is based on using versions of objects (inodes) to allow clients to recover later than ordinary server recovery timeframe.

# The server changes the version of object during any change and return that data to client. The version may be checked during replay to be sure that object is the same state during replay as it was originally.
# After failure the server starts recovery as usual but if some client miss the version check will be used for replays.
# Missed client can connect later and try to recover. This is 'delayed recovery' and version check is used during it always.
# The client which missed main recovery window will not be evicted and can connect later to initiate recovery. In that case the versions will checked to determine was that object changed by someone else or not.
# When finished with replay, client and server check if any replay failed on any request because of version mismatch. If not, the client will get a successful reintegration message. If a version mismatch was encountered, the client must be evicted.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==IAM==

'''Summary'''

IAM stands for 'Index Access Module': it is an extension to the ldiskfs directory code, adding generic indexing capability.
 
 
File system directory can be thought of as an index mapping keys, which are strings (file names), to the records which are integers (inode numbers). IAM removes limitations on key and record size and format, providing an abstraction of a transactional container, mapping arbitrary opaque keys into opaque records.
 
 
Implementation notes:

* IAM is implemented as a set of patches to the ldiskfs;
* IAM is an extension of ldiskfs directory code that uses htree data-structure for scalable indexing;
* IAM uses fine-grained key-level and node-level locking (pdirops locking, designed and implemented by Alex Tomas);
* IAM doesn't assume any internal format keys. Keys are compared by memcmp() function (which dictates BE order for scalars);
* IAM supports different flavors of containers:
** lfix: fixed size record and fixed size keys,
** lvar: variable sized records and keys,
** htree: compatibility mode, allowing normal htree directory to be accessed as an IAM container;
* IAM comes with ioctl(2) based user-level interface.

IAM is used by ldiskfs-OSD to implement dt_index_operations interface.
</blockquote>

'''Code'''
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-ops.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-rhel5.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-vanilla.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-separate.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.9-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-common.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-uapi.patch

==SOM==

'''Summary'''

Size-on-MDS is a metadata improvement, which includes the caching of the inode size, blocks, ctime and mtime on MDS. Such an attribute caching allows clients to avoid making RPCs to the OSTs to find the attributes encoded in the file objects kept on those OSTs what results in the significantly improved performance of listing directories.

'''Code'''
llite/llite_close.c - client-side SOM code
liblustre/file.c - liblustre SOM code
mdt/mdt_handler.c - general handling of SOM-related rpc
mdt/mdt_open.c - MDS side SOM code
mdt/mdt_recovery.c - MDS side SOM recovery code
obdfilter/filter_log.c - OST side IO epoch lloging code;

==tests==

'''Summary'''

The "tests" subsystem is a set of scripts and programs which is used to test other lustre subsystems. It contains:

'''''runtests'''''
 
Simple basic regression test
 
 
'''''sanity'''''
 
A set of regression tests that verify operation under normal operating
conditions
 
 
'''''fsx'''''
 
file system exerciser
 
 
'''''sanityn'''''
 
Tests that verify operations from two clients under normal operating conditions
 
 
'''''lfsck'''''
 
Tests e2fsck and lfsck to detect and fix filesystem corruption
 
 
'''''liblustre'''''
 
Runs a test linked to a liblustre client library
 
 
'''''replay-single'''''
 
A set of unit tests that verify recovery after MDS failure
 
 
'''''conf-sanity'''''
 
A set of unit tests that verify the configuration
 
 
'''''recovery-small'''''
 
A set of unit tests that verify RPC replay after communications failure
 
 
'''''replay-ost-single'''''
 
A set of unit tests that verify recovery after OST failure
 
 
'''''replay-dual'''''
 
A set of unit tests that verify the recovery from two clients after server failure
 
 
'''''insanity'''''
 
A set of tests that verify the multiple concurrent failure conditions
 
 
'''''sanity-quota'''''
 
A set of tests that verify filesystem quotas
 
 
The acceptance-small.sh is a wrapper which is normally used to run all (or any) of these scripts. In additional it is used to run the following pre-installed benchmarks:
 
 
'''''dbench'''''
 
Dbench benchmark for simulating N clients to produce the filesystem load
 
 
'''''bonnie'''''
 
Bonnie++ benchmark for creation, reading, and deleting many small files
 
 
'''''iozone'''''
 
Iozone benchmark for generating and measuring a variety of file operations.

'''Code'''
lustre/tests/acl/run
lustre/tests/acl/make-tree
lustre/tests/acl/README
lustre/tests/acl/setfacl.test
lustre/tests/acl/getfacl-noacl.test
lustre/tests/acl/permissions.test
lustre/tests/acl/inheritance.test
lustre/tests/acl/misc.test
lustre/tests/acl/cp.test
lustre/tests/cfg/local.sh
lustre/tests/cfg/insanity-local.sh
lustre/tests/ll_sparseness_write.c
lustre/tests/writeme.c
lustre/tests/cobd.sh
lustre/tests/test_brw.c
lustre/tests/ll_getstripe_info.c
lustre/tests/lov-sanity.sh
lustre/tests/sleeptest.c
lustre/tests/flocks_test.c
lustre/tests/getdents.c
lustre/tests/ll_dirstripe_verify.c
lustre/tests/sanity.sh
lustre/tests/multifstat.c
lustre/tests/sanityN.sh
lustre/tests/liblustre_sanity_uml.sh
lustre/tests/fsx.c
lustre/tests/small_write.c
lustre/tests/socketserver
lustre/tests/cmknod.c
lustre/tests/README
lustre/tests/acceptance-metadata-double.sh
lustre/tests/writemany.c
lustre/tests/llecho.sh
lustre/tests/lfscktest.sh
lustre/tests/run-llog.sh
lustre/tests/conf-sanity.sh
lustre/tests/mmap_sanity.c
lustre/tests/write_disjoint.c
lustre/tests/ldaptest.c
lustre/tests/acceptance-metadata-single.sh
lustre/tests/compile.sh
lustre/tests/mcreate.c
lustre/tests/runas.c
lustre/tests/replay-single.sh
lustre/tests/lockorder.sh
lustre/tests/test2.c
lustre/tests/llog-test.sh
lustre/tests/fchdir_test.c
lustre/tests/mkdirdeep.c
lustre/tests/runtests
lustre/tests/flock.c
lustre/tests/mlink.c
lustre/tests/checkstat.c
lustre/tests/crash-mod.sh
lustre/tests/multiop.c
lustre/tests/random-reads.c
lustre/tests/disk1_4.zip
lustre/tests/rundbench
lustre/tests/wantedi.c
lustre/tests/rename_many.c
lustre/tests/leak_finder.pl
lustre/tests/Makefile.am
lustre/tests/parallel_grouplock.c
lustre/tests/chownmany.c
lustre/tests/ost_oos.sh
lustre/tests/mkdirmany.c
lustre/tests/directio.c
lustre/tests/insanity.sh
lustre/tests/createmany-mpi.c
lustre/tests/createmany.c
lustre/tests/runiozone
lustre/tests/rmdirmany.c
lustre/tests/replay-ost-single.sh
lustre/tests/mcr.sh
lustre/tests/mrename.c
lustre/tests/sanity-quota.sh
lustre/tests/lp_utils.c
lustre/tests/lp_utils.h
lustre/tests/acceptance-metadata-parallel.sh
lustre/tests/oos.sh
lustre/tests/createdestroy.c
lustre/tests/toexcl.c
lustre/tests/replay-dual.sh
lustre/tests/createtest.c
lustre/tests/munlink.c
lustre/tests/iopentest1.c
lustre/tests/iopentest2.c
lustre/tests/openme.c
lustre/tests/openclose.c
lustre/tests/test-framework.sh
lustre/tests/ll_sparseness_verify.c
lustre/tests/it_test.c
lustre/tests/unlinkmany.c
lustre/tests/opendirunlink.c
lustre/tests/filter_survey.sh
lustre/tests/utime.c
lustre/tests/openunlink.c
lustre/tests/runvmstat
lustre/tests/statmany.c
lustre/tests/create.pl
lustre/tests/oos2.sh
lustre/tests/statone.c
lustre/tests/rename.pl
lustre/tests/set_dates.sh
lustre/tests/openfilleddirunlink.c
lustre/tests/openfile.c
lustre/tests/llmountcleanup.sh
lustre/tests/llmount.sh
lustre/tests/acceptance-small.sh
lustre/tests/truncate.c
lustre/tests/recovery-small.sh
lustre/tests/2ost.sh
lustre/tests/tchmod.c
lustre/tests/socketclient
lustre/tests/runobdstat
lustre/tests/memhog.c
lustre/tests/flock_test.c
lustre/tests/busy.sh
lustre/tests/write_append_truncate.c
lustre/tests/opendevunlink.c
lustre/tests/o_directory.c

==build==

'''Summary'''

The build system is responsible for building Lustre and related components (ldiskfs is normally included in the Lustre tree but can also live completely separately).
 
 
The main build process is managed using GNU Autoconf and Automake. Here is a brief outline of how a Lustre binary build from a fresh Git checkout works. User commands are shown in '''bold'''.

* '''sh autogen.sh''' - autogen performs a few checks and bootstraps the build system using automake and autoconf. It should only need to be called once for a fresh Git cloning, but sometimes it needs to be run again. See [https://bugzilla.lustre.org/show_bug.cgi?id=12580 bug 12580].
**Each component (Lustre and ldiskfs) has an autoMakefile.am in its toplevel directory that sets some variables and includes build/autoMakefile.am.toplevel. It also contains any toplevel autoMake code unique to that component.
** configure.ac is used by autoconf to generate a configure script. The Lustre configure.ac mostly relies on the macros defined in */autoconf/*.m4 to do its work. The ldiskfs configure.ac is more self-contained and relies only on build/autoconf/*.m4.

* '''./configure --with-linux=/root/cfs/kernels/linux-2.6.9-55.EL.HEAD''' - Configure performs extensive checks of the underlying system and kernel, then produces autoMakefiles and Makefiles.

* '''make''' - This is where things get really interesting.
** The @INCLUDE_RULES@ directive in most Makefile.in files includes a whole set of build rules from build/Makefile. See the top of that file for a description of all cases.
** Normally, it will include autoMakefile, so commands from that file will run.
** build/autoMakefile.am.toplevel is the basis of the autoMakefile produced in the toplevel directory. It includes the "modules" target.
** The modules target in turn calls the appropriate Linux make system if we are building on Linux.
** This build system once again reads the Makefile in each directory, and case 2 from build/Makefile is followed.

So essentially, the Makefile.in controls the kernel build process, and the autoMakefile.am controls the userland build process as well as preparing the sources if necessary.
 
 
The build system can also be used to produce Lustre-patched kernels and binaries built against these kernels. The '''build/lbuild''' script does this - this is used by customers as well as the LTS. This script is in need of some serious cleanup, unfortunately.
 
 
Coding style note: as mentioned in [http://wiki.lustre.org/index.php?title=Coding_Guidelines Coding Guidelines], autoconf macros must follow the style specified in the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html GNU Autoconf manual]. A lot of the older code has inconsistent style and is hard to follow - feel free to reformat when needed. New code '''must''' be styled correctly.

'''Code'''

Lustre build system:

* build/* (shared with ldiskfs)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

ldiskfs build system:

* build/* (shared with Lustre)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre-ldiskfs.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

Subsystem Map

2009-12-15T22:34:34Z

Nathan: /* ldiskfs */

The Lustre subsystems are listed below. For each subsystem, a summary description and code is provided.

==libcfs==

'''Summary'''
 
 
Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.

'''Code'''
lustre/lnet/libcfs/**/*.[ch]

==lnet==

'''Summary'''
 
 
See the [http://www.sun.com/software/products/lustre/docs/Lustre-networking.pdf Lustre Networking] white paper for details.

'''Code'''
lustre/lnet/**/*.[ch]

==ptlrpc==

'''Summary'''
 
 
Ptlrpc implements Lustre communications over LNET.
 
 
All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.
 
 
The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.
 
 
Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

'''Code'''
lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

==llog==

'''Summary'''

'''Overview'''

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.
 
 
LLog is used is various cases. The main LLog use cases are the following:

* mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;

* MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;

* Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);

* LOVEA llog - joins the file LOV EA merge log.

'''General design'''

Each llog type has two main parts:

* ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;

* REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

'''Code'''
obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c

For more information, see [[Logging API]].

==obdclass==

'''Summary'''

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.
 
 
Several examples of obd devices include:

* OSC - object storage client (connects over network to OST)
* OST - object storage target
* LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
* MDC - meta data client (connects over network to MDT)
* MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

'''Code'''
lustre/obdclass/class_hash.c - scalable hash code for imports
lustre/obdclass/class_obd.c - base device handling code
lustre/obdclass/debug.c - helper routines for dumping data structs
lustre/obdclass/genops.c - device allocation/configuration/connection
lustre/obdclass/linux-module. - linux kernel module handling
lustre/obdclass/linux-obdo.c - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c - /proc/sys configuration parameters
lustre/obdclass/lprocfs_status.c - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c - peer target identification by UUID
lustre/obdclass/obd_config.c - configuration file parsing
lustre/obdclass/obd_mount.c - server filesystem mounting
lustre/obdclass/obdo.c - more obdo handling helpers
lustre/obdclass/statfs_pack.c - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c - UUID pack/unpack
lustre/lvfs/lvfs_common.c - kernel interface helpers
lustre/lvfs/lvfs_darwin.c - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c - statistics
lustre/lvfs/lvfs_linux.c - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c - userspace helper routines
lustre/lvfs/prng.c - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c - supplementary group upcall for MDS

==luclass==

'''Summary'''

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.
 
 
Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

* Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;

* Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;

* Objects are reference counted, and cached;

* Every object has a list of ''layers'' (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.
 
 
luclass design is specified in the [http://arch.lustre.org/images/a/aa/Md-api-dld.pdf MD API] DLD.

'''Code'''
include/lu_object.h
obdclass/lu_object.c

==ldlm==

'''Summary'''

The Lustre Distributed Lock Manager (LDLM) is the Lustre locking infrastructure; it handles locks between clients and servers and locks local to a node. Different kinds of locks are available with different properties. Also as a historic heritage, ldlm happens to have some of the generic connection service code (both server and client).

'''Code'''
interval_tree.c - This is used by extent locks to maintain interval trees (bug 11300)
l_lock.c - Resourse locking primitives.
ldlm_extent.c - Extents locking code used for locking regions inside objects
ldlm_flock.c - Bsd and posix locking lock types
ldlm_inodebits.c - Inodebis locks used for metadata locking
ldlm_lib.c - Target and client connecting/reconnecting/recovery code.
Does not really belong to ldlm, but is historically placed
there. Should be in ptlrpc instead.
ldlm_lock.c - This source file mostly has functions dealing with struct
ldlm_lock ldlm_lockd.c - Functions that imply replying to incoming lock-related rpcs
(that could be both on server (lock enq/cancel/...) and client
(ast handling)).
ldlm_plain.c - Plain locks, predecessor to inodebits locks; not widely used now.
ldlm_pool.c - Pools of locks, related to dynamic lrus and freeing locks on demand.
ldlm_request.c - Collection of functions to work with locks based handles as opposed
to lock structures themselves.
ldlm_resource.c - Functions operating on namespaces and lock resources.
include/lustre_dlm.h - Important defines and declarations for ldlm.

==fids==

'''Summary'''

FID is unique object identifier in cluster since 1.7. It has few properties, main of them are the following:

* FID is unique and not reused object identifier;
* FID is allocated by client inside of the sequence granted by server;
* FID is base for ldlm resource used for issuing ldlm locks. This is because FID is unique and as such good for this using;
* FID is base for building client side inode numbers as we can't use server inode+generation anymore, in CMD this is not unique combination;
* FID does not contain store information like inode number or generation and as such easy to migrate;

FID consists of 3 fields:

* f_seq - sequence number
* f_oid - object identifier inside sequence
* f_ver - object version

'''Code'''
fid/fid_request.c
fid/fid_lib.c
fld/*.[ch]

==seq==

'''Summary'''

'''Overview'''

Sequence management is a basic mechanism in new MDS server which is related to managing FIDs.
 
 
FID is an unique object identifier in Lustre starting from version 1.7. All FIDs are organized into sequences. One sequence is number of FIDs. Sequences are granted/allocated to clients by servers. FIDs are allocated by clients inside granted sequence. All FIDs inside one sequence live on same MDS server and as such are one "migration unit" and one "indexing unit", meaning that FLD (FIDs Location Database) indexes them all using one sequence and thus has only one mapping entry for all FIDs in sequence. Please read section devoted to FIDs bellow in the root table to find more info on FLD service and FIDs.
 
 
A sequence has the limit of FIDs to be allocated in it. When this limit is reached, new sequence is allocated. Upon disconnect, server allocates new sequence to the client when it comes back. Previously used sequence is abandoned even if it was not exhausted. Sequences are valuable resource but in the case of recovery, using new sequence makes things easier and also allows to group FIDs and objects by working sessions, new connection - new sequence.

'''Code description'''

Server side code is divided into two parts:

* Sequence controller - allocates super-sequences, that is, sequences of sequences to all servers in cluster (currently only to MDSes as only they are new FIDs aware). Usually first MDS in cluster is sequence controller

* Sequence manager - allocates meta-sequences (smaller range of sequences inside a super-sequence) to all clients, using granted super-sequence from the sequence controller. All MDSs in the cluster (all servers in the future) are sequence managers. The first MDS is, simultaneously, a sequence controller and a sequence manager.

Client side code allocates new sequences from granted meta-sequence. When meta-sequence is exhausted, new one is allocated on server and sent to the client.
 
 
Client code consists of API for working with both server side parts, not only with sequence manager as all servers need to talk to sequence controller, they also use client API for this.
 
 
One important part of client API is FIDs allocation. New FID is allocated in currently granted sequence until sequence is exhausted.

'''Code'''
fid/fid_handler.c - server side sequence management code;
fid/fid_request.c - client side sequence management code;
fid/fid_lib.c - fids related miscellaneous stuff.

==mountconf==

'''Summary'''

MountConf is how servers and clients are set up, started, and configured. A MountConf usage document is [http://wiki.lustre.org/index.php?title=Mount_Conf here].
 
 
The major subsystems are the MGS, MGC, and the userspace tools mount.lustre and mkfs.lustre.
 
 
The basic idea is:

# Whenever any Lustre component is mount(2)ed, we start a MGC.
# This establishes a connection to the MGS and downloads a configuration llog.
# The MGC passes the configuration log through the parser to set up the other OBDs.
# The MGC holds a CR configuration lock, which the MGS recalls whenever a live configuration change is made.

'''Code'''

MountConf file areas:

lustre/mgs/*
lustre/mgc/*
lustre/obdclass/obd_mount.c
lustre/utils/mount_lustre.c
lustre/utils/mkfs_lustre.c

==liblustre==

'''Summary'''

Liblustre is a userspace library, used along with libsysio (developed by Sandia), that allows Lustre usage just by linking (or ld_preload'ing) applications with it. Liblustre does not require any kernel support. It is also used on old Cray XT3 machines (and not so old, in the case of Sandia), where all applications are just linked with the library and loaded into memory as the only code to run. Liblustre does not support async operations of any kind due to a lack of interrupts and other notifiers from lower levels to Lustre. Liblustre includes another set of LNDs that are able to work from userspace.

'''Code'''
dir.c - Directory operations
file.c - File handling operations (like open)
llite_lib.c - General support (init/cleanp/parse options)
lutil.c - Supplementary code to get IP addresses and init various structures
needed to emulate the normal Linux process from other layers' perspective.
namei.c - Metadata operations code.
rw.c - I/O code, including read/write
super.c - "Superblock" operation - mounting/umounting, inode operations.
tests - directory with liblustre-specific tests.

==echo client/server==

'''Summary'''

The echo_client and obdecho are OBD devices which help testing and performance measurement.
 
 
They were implemented originally for network testing - obdecho can replace obdfilter and echo_client can exercise any downstream configurations. They are normally used in the following configurations:

* echo_client -> obdfilter. This is used to measure raw backend performance without any network I/O.
* echo_client -> OSC -> <network> -> OST -> obdecho. This is used to measure network and ptlrpc performance.
* echo_client -> OSC -> <network> -> OST -> obdfilter. This is used to measure performance available to the Lustre client.

'''Code'''
lustre/obdecho/

==client vfs==

'''Summary'''

The client VFS interface, also called '''llite''', is the bridge between the Linux kernel and the underlying Lustre infrastructure represented by the [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV], [https://wikis.clusterfs.com/intra/index.php/Client_metadata_summary MDC], and [https://wikis.clusterfs.com/intra/index.php/Ldlm_summary LDLM] subsystems. This includes mounting the client filesystem, handling name lookups, starting file I/O, and handling file permissions.
 
 
The Linux VFS interface shares a lot in common with the liblustre interface, which is used in the Catamount environment; as of yet, the code for these two subsystems is not common and contains a lot of duplication.

'''Code'''
lustre/llite/dcache.c - Interface with Linux dentry cache/intents
lustre/llite/dir.c - readdir handling, filetype in dir, dir ioctl
lustre/llite/file.c - File handles, file ioctl, DLM extent locks
lustre/llite/llite_close.c - File close for opencache
lustre/llite/llite_internal.h - Llite internal function prototypes, structures
lustre/llite/llite_lib.c - Majority of request handling, client mount
lustre/llite/llite_mmap.c - Memory-mapped I/O
lustre/llite/llite_nfs.c - NFS export from clients
lustre/llite/lloop.c - Loop-like block device export from object
lustre/llite/lproc_llite.c - /proc interface for tunables, statistics
lustre/llite/namei.c - Filename lookup, intent handling
lustre/llite/rw24.c - Linux 2.4 IO handling routines
lustre/llite/rw26.c - Linux 2.6 IO handling routines
lustre/llite/rw.c - Linux generic IO handling routines
lustre/llite/statahead.c - Directory statahead for "ls -l" and "rm -r"
lustre/llite/super25.c - Linux 2.6 VFS file method registration
lustre/llite/super.c - Linux 2.4 VFS file method registration
lustre/llite/symlink.c - Symbolic links
lustre/llite/xattr.c - User-extended attributes

==client vm==

'''Summary'''

Client code interacts with VM/MM subsystems of the host OS kernel to cache data (in the form of pages), and to react to various memory-related events, like memory pressure.
 
 
Two key components of this interaction are:

* cfs_page_t data-type representing MM page. It comes together with the interface to map/unmap page to/from kernel virtual address space, access various per-page bits, like 'dirty', 'uptodate', etc., lock/unlock page. Currently, this data-type closely matches the Linux kernel page. It has to be straightened out, formalized, and expanded to include functionality like querying about total number of pages on a node, etc.
* MM page operations in cl_page (part of new client I/O interface).

'''Code'''

This describes the ''next generation'' Lustre client I/O code, which is expected to appear in Lustre 2.0. Code location is not finalized.
 
 
cfs_page_t interface is defined and implemented in:

lnet/include/libcfs/ARCH/ARCH-mem.h
lnet/libcfs/ARCH/ARCH-mem.c

Generic part of cl-page will be located in:

include/cl_object.h
obdclass/cl_page.c
obdclass/cl_object.c

Linux kernel implementation is currently in:

llite/llite_cl.c

==client I/O==

'''Summary'''

Client I/O is a group of interfaces used by various layers of a Lustre client to manage file data (as opposed to metadata). Main functions of these interfaces are:

* Cache data, respecting limitations imposed both by hosting MM/VM, and by cluster-wide caching policies, and
* Form a stream of efficient I/O RPCs, respecting both ordering/timing constraints imposed by the hosting VFS (e.g., POSIX guarantees, O_SYNC, etc.), and cluster-wide IO scheduling policies.

Client I/O subsystem interacts with VFS, VM/MM, DLM, and PTLRPC.
 
 
Client I/O interfaces are based on the following data-types:

* cl_object: represents a file system object, both a file, and a stripe;
* cl_page: represents a cached data page;
* cl_lock: represents an extent DLM lock;
* cl_io: represents an ongoing high-level IO activity, like read(2)/write(2) system call, or sub-io of another IO;
* cl_req: represents a network RPC.

'''Code'''

This describes the ''next generation'' Lustre client I/O code. The code location is not finalized. The generic part is at:

include/cl_object.h
obdclass/cl_object.c
obdclass/cl_page.c
obdclass/cl_lock.c
obdclass/cl_io.c

Layer-specific methods are currently at:

lustre/LAYER/LAYER_cl.c

where LAYER is one of llite, lov, osc.

==client metadata==

'''Summary'''

The Meta Data Client (MDC) is the client-side interface for all operations related to the Meta Data Server MDS. In current configurations there is a single MDC on the client for each filesystem mounted on the client. The MDC is responsible for enqueueing metadata locks (via LDLM), and packing and unpacking messages on the wire.
 
 
In order to ensure a recoverable system, the MDC is limited at the client to only a single filesystem-modifying operation in flight at one time. This includes operations like create, rename, link, unlink, setattr.
 
 
For non-modifying operations like getattr and statfs the client can multiple RPC requests in flight at one time, limited by a tunable on the client, to avoid overwhelming the MDS.

'''Code'''
lustre/mdc/lproc_mdc.c - /proc interface for stats/tuning
lustre/mdc/mdc_internal.h - Internal header for prototypes/structs
lustre/mdc/mdc_lib.c - Packing of requests to MDS
lustre/mdc/mdc_locks.c - Interface to LDLM and client VFS intents
lustre/mdc/mdc_reint.c - Modifying requests to MDS
lustre/mdc/mdc_request.c - Non-modifying requests to MDS

==client lmv==

'''Summary'''

LMV is a module which implements CMD client-side abstraction device. It allows client to work with many MDSes without any changes in Llite module and even without knowing that CMD is supported. Llite just translates Linux VFS requests into metadata API calls and forwards them down to the stack.
 
 
As LMV needs to know which MDS to talk for any particular operation, it uses some new services introduced in CMD3 times. They are:

* FLD (Fids Location Database) - having FID or rather its sequence, lookup MDS number where this FID is located;
* SEQ (Client Sequence Manager) - LMV uses this via children MDCs for allocating new sequences and FIDs.

LMV supports split objects. This means that for every split directory it creates special in-memory structure which contains information about object stripes. This includes MDS number, FID, etc. All consequent operations use these structures for determining what MDS should be used for particular action (create, take lock, etc).

'''Code'''
lmv/*.[ch]

==lov==

'''Summary'''

The LOV device presents a single virtual device interface to upper layers (llite, liblustre, MDS). The LOV code is responsible for splitting of requests to the correct OSTs based on striping information (lsm), and the merging of the replies to a single result to pass back to the higher layer.
 
 
It calculates per-object membership and offsets for read/write/truncate based on the virtual file offset passed from the upper layer. It is also responsible for splitting the locking across all servers as needed.
 
 
The LOV on the MDS is also involved in object allocation.

'''Code'''
lustre/lov/lov_ea.c - Striping attributes pack/unpack/verify
lustre/lov/lov_internal.h - Header for internal function prototypes/structs
lustre/lov/lov_merge.c - Struct aggregation from many objects
lustre/lov/lov_obd.c - Base LOV device configuration
lustre/lov/lov_offset.c - File offset and object calculations
lustre/lov/lov_pack.c - Pack/unpack of striping attributes
lustre/lov/lov_qos.c - Object allocation for different OST loading
lustre/lov/lov_request.c - Request handling/splitting/merging
lustre/lov/lproc_lov.c - /proc/fs/lustre/lov tunables/statistics

==quota==

'''Summary'''

Quotas allow a system administrator to limit the maximum amount of disk space a user or group can consume. Quotas are set by root, and can be specified for individual users and/or groups. Quota limits can be set on both blocks and inodes.
 
 
Lustre quota enforcement differs from standard Linux quota support in several ways:

* Lustre quota are administered via the lfs command, whereas standard Linux quota uses the quotactl interface.
* As Lustre is a distributed filesystem, lustre quotas are also distributed in order to limit the impact on performance.
* Quotas are allocated and consumed in a quantized fashion.

'''Code'''

Quota core:

lustre/quota/quota_adjust_qunit.c
lustre/quota/quota_check.c
lustre/quota/quotacheck_test.c
lustre/quota/quota_context.c
lustre/quota/quota_ctl.c
lustre/quota/quota_interface.c
lustre/quota/quota_internal.h
lustre/quota/quota_master.c

Interactions with the underlying ldiskfs filesystem:

lustre/lvfs/fsfilt_ext3.c
lustre/lvfs/lustre_quota_fmt.c
lustre/lvfs/lustre_quota_fmt_convert.c

Hooks under:

lustre/mds
lustre/obdfilter

Regression tests:

lustre/tests/sanity-quota.sh

==security-gss==

'''Summary'''

The secure ptlrpc (sptlrpc) is a framework inside of ptlrpc layer. It act upon both side of each ptlrpc connection between 2 nodes, doing transformation on every RPC message, turn this into a secure communication link. By using GSS, sptlrpc is able to support multiple authentication mechanism, but currently we only support Kerberos 5.
 
 
Supported security flavors:

* null: no authentication, no data transform, thus no performance overhead; compatible with 1.6;
* plain: no authentication, simple data transform, minimal performance overhead;
* krb5x: per-user basis client-server mutual authentication using Kerberos 5, sign or encrypt data, could have substantial CPU overhead.

'''Code'''
lustre/ptlrpc/sec*.c
lustre/ptlrpc/gss/
lustre/utils/gss/

==security-capa==

'''Summary'''

Capabilities are pieces of data generated by one service - the master service, passed to a client and presented by the client to another service - the slave service, to authorize an action. It is independent from the R/W/X permission based file operation authorization.

'''Code'''
lustre/llite/llite_capa.c
lustre/mdt/mdt_capa.c
lustre/obdfilter/filter_capa.c
lustre/obdclass/capa.c
lustre/include/lustre_capa.h

==security-identity==

'''Summary'''

Lustre identity is a miscellaneous framework for lustre file operation authorization. Generally, it can be divided into two parts:

* User-identity parse / upcall / mapping.
* File operation permission maintenance and check, includes the traditional file mode based permission and ACL based permission.

'''Code'''
/llite/llite_rmtacl.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_lib.c
lustre/obdclass/idmap.c
lustre/utils/l_getidentity.c
lustre/include/lustre_idmap.h

lustre/llite/xattr.c
lustre/mdt/mdt_xattr.c
lustre/cmm/cmm_object.c
lustre/cmm/mdc_object.c
lustre/mdd/mdd_permission.c
lustre/mdd/mdd_object.c
lustre/mdd/mdd_dir.c
lustre/obdclass/acl.c
lustre/include/lustre_eacl.h

==OST==

'''Summary'''

OST is a very thin layer of data server. Its main responsibility is to translate RPCs to local calls of obdfilter, i.e. RPC parsing.

'''Code'''
lustre/ost/*.[ch]

==ldiskfs==

'''Summary'''

ldiskfs is local disk filesystem built on top of ext3. it adds extents support to ext3, multiblock allocator, multimount protection and iopen features.

'''Code'''

There is no ldiskfs source code in the Lustre repositories (only patches). Instead, ext3 code is copied from your build kernel, the patches are applied and then whole thing gets renamed to ldiskfs. For details, go to ldiskfs/.

==fsfilt==

'''Summary'''

The fsfilt layer abstracts the backing filesystem specifics away from the obdfilter and mds code in 1.4 and 1.6 lustre. This avoids linking the obdfilter and mds directly against the filesystem module, and in theory allows different backing filesystems, but in practise this was never implemented. In Lustre 1.8 and later this code is replaced by the OSD layer.
 
 
There is a core fsfilt module which can auto-load the backing filesystem type based on the type specified during configuration. This loads a filesystem-specific fsfilt_{fstype} module with a set of methods for that filesystem.
 
 
There are a number of different kinds of methods:

* Get/set filesystem label and UUID for identifying the backing filesystem
* Start, extend, commit compound filesystem transactions to allow multi-file updates to be atomic for recovery
* Set a journal callback for transaction disk commit (for Lustre recovery)
* Store attributes in the inode (possibly avoiding side-effects like truncation when setting the inode size to zero)
* Get/set file attributes (EAs) for storing LOV and OST info (e.g. striping)
* Perform low-level IO on the file (avoiding cache)
* Get/set file version (for future recovery mechanisms)
* Access quota information

'''Code'''

The files used for the fsfilt code reside in:

lustre/lvfs/fsfilt.c - Interface used by obdfilter/MDS, module autoloading
lustre/lvfs/fsfilt_ext3.c - Interface to ext3/ldiskfs filesystem

The ''fsfilt_ldiskfs.c'' file is auto-generated from ''fsfilt_ext3.c'' in ''lustre/lvfs/autoMakefile.am'' using sed to replace instances of ext3 and EXT3 with ldiskfs, and a few other replacements to avoid symbol clashes.

==ldiskfs OSD==

'''Summary'''

ldiskfs-OSD is an implementation of dt_{device,object} interfaces on top of (modified) ldiskfs file-system.
 
 
It uses standard ldiskfs/ext3 code to do file I/O.
 
 
It supports 2 types of indices (in the same file system):

* iam-based index: this is an extension of ext3 htree directory format with support for more general keys and values, and with relaxed size restrictions, and
* compatibility index: this is usual ldiskfs directory, accessible through dt_index_operations.

ldiskfs-OSD uses read-write mutex to serialize compound operations.
</blockquote>

'''Code'''
lustre/include/dt_object.h
lustre/osd/osd_internal.h
lustre/osd/osd_handler.c

==DMU OSD==

'''Summary'''

This is another implementation of the OSD API for userspace DMU. It uses DMU's ZAP for indices.

'''Code'''
dmu-osd/*.[ch] in b_hd_dmu branch

==DMU==

'''Summary'''

The DMU is one of the layers in Sun's ZFS filesystem which is responsible for presenting a transactional object store to its consumers. It is used as Lustre's backend object storage mechanism for the userspace MDSs and OSSs.
 
 
The ZFS community page has a source tour which is useful as an introduction to the several ZFS layers: [http://www.opensolaris.org/os/community/zfs/source/ ZFS source]
 
 
There are many useful resources in that community page.
 
 
For reference, here's a list of DMU features:

* Atomic transactions
* End-to-end data and metadata checksumming (currently supports fletcher2, fletcher4 and sha-256)
* Compression (currently supports lzjb and gzip with compression levels 1..9)
* Snapshots and clones
* Variable block sizes (currently supports sector sizes from 512 bytes to 128KB)
* Integrated volume management with support for RAID-1, RAID-Z and RAID-Z2 and striping
* Metadata and optional data redundancy (ditto blocks) atop the inherent storage pool redundancy for high resilience
* Self-healing, which works due to checksumming, ditto blocks and pool redundancy
* Storage devices that act as level-2 caches (designed for flash storage)
* Hot spares
* Designed with scalability in mind - supports up to 2^64 bytes per object, 2^48 objects per filesystem, 2^64 filesystems per pool, 2^64 bytes per device, 2^64 devices per pool, ..
* Very easy to use admin interface (zfs and zpool commands)

'''Code'''
src/
source code

src/cmd/ - ZFS/DMU related programs
src/cmd/lzfs/ - lzfs, the filesystem administration utility
src/cmd/lzpool/ - lzpool, the pool administration utility
src/cmd/lzdb/ - lzdb, the zfs debugger
src/cmd/lztest/ - lztest, the DMU test suite
src/cmd/lzfsd/ - lzfsd, the ZFS daemon

src/lib/ - Libraries
src/lib/port/ - Portability layer
src/lib/solcompat/ - Solaris -> Linux portability layer (deprecated, use libport instead)
src/lib/avl/ - AVL trees, used in many places in the DMU code
src/lib/nvpair/ - Name-value pairs, used in many places in the DMU code
src/lib/umem/ - Memory management library
src/lib/zpool/ - Main ZFS/DMU code
src/lib/zfs/ - ZFS library used by the lzfs and lzpool utilities
src/lib/zfscommon/ - Common ZFS code between libzpool and libzfs
src/lib/ctl/ - Userspace control/management interface
src/lib/udmu/ - Lustre uDMU code (thin library around the DMU)

src/scons/ - local copy of SCons

tests/regression/ - Regression tests

misc/ - miscellaneous files/scripts

==obdfilter==

'''Summary'''

obdfilter is a core component of OST (data server) making underlying disk filesystem a part of distributed system:

* Maintains cluster-wide coherency for data
* Maintains space reservation for data in client's cache (grants)
* Maintains quota

'''Code'''
lustre/obdfilter/*.[ch]

==MDS==

'''Summary'''

The MDS service in Lustre 1.4 and 1.6 is a monolithic body of code that provides multiple functions related to filesystem metadata. It handles the incoming RPCs and service threads for metadata operations (create, rename, unlink, readdir, etc), interfaces with the Lustre lock manager ([https://wikis.clusterfs.com/intra/index.php/Ldlm_summary DLM]), and also manages the underlying filesystem (via the [https://wikis.clusterfs.com/intra/index.php/Fsfilt_summary interface fsfilt] interface).
 
 
The MDS is the primary point of access control for clients, allocates the objects belonging to a file (in conjunction with [https://wikis.clusterfs.com/intra/index.php/Lov_summary LOV]) and passing that information to the clients when they access a file.
 
 
The MDS is also ultimately responsible for deleting objects on the OSTs, either by passing object information for destroy to the client removing the last link or open reference on a file and having the client do it, or by destroying the objects on the OSTs itself in case the client fails to do so.
 
 
In the 1.8 and later releases, the functionality provided by the MDS code has been split into multiple parts ([https://wikis.clusterfs.com/intra/index.php/Mdt_summary MDT], [https://wikis.clusterfs.com/intra/index.php/Mdd_summary MDD], OSD) in order to allow stacking of the metadata devices for clustered metadata.

'''Code'''
lustre/mds/commit_confd.c
lustre/mds/handler.c - RPC request handler
lustre/mds/lproc_mds.c - /proc interface for stats/control
lustre/mds/mds_fs.c - Mount/configuration of underlying filesystem
lustre/mds/mds_internal.h - Header for internal declarations
lustre/mds/mds_join.c - Handle join_file operations
lustre/mds/mds_lib.c - Unpack of wire structs from requests
lustre/mds/mds_log.c - Lustre log interface (llog) for unlink/setattr
lustre/mds/mds_lov.c - Interface to LOV for create and orphan
lustre/mds/mds_open.c - File open/close handling
lustre/mds/mds_reint.c - Reintegration of changes made by clients
lustre/mds/mds_unlink_open.c - Handling of open-unlinked files (PENDING dir)
lustre/mds/mds_xattr.c - User-extended attribute handling

==MDT==

'''Summary'''

MDT stands for MetaData Target. This is a top-most layer in the MD server device stack. Responsibility of MDT are all this networking, as far as meta-data are concerned:

* Managing PTLRPC services and threads;
* Receiving incoming requests, unpacking them and checking their validity;
* Sending replies;
* Handling recovery;
* Using DLM to guarantee cluster-wide meta-data consistency;
* Handling intents;
* Handling credential translation.

Theoretically MDT is an optional layer: completely local Lustre setup, with single mete-data server, and locally mounted client can exist without MDT (and still use networking for non-metadata access).

'''Code'''
lustre/mdt/mdt.mod.c
lustre/mdt/mdt_capa.c
lustre/mdt/mdt_handler.c
lustre/mdt/mdt_identity.c
lustre/mdt/mdt_idmap.c
lustre/mdt/mdt_internal.h
lustre/mdt/mdt_lib.c
lustre/mdt/mdt_lproc.c
lustre/mdt/mdt_open.c
lustre/mdt/mdt_recovery.c
lustre/mdt/mdt_reint.c
lustre/mdt/mdt_xattr.c

==CMM==

'''Summary'''

'''Overview'''

The CMM is a new layer in the MDS which cares about all clustered metadata issues and relationships. The CMM does the following:

* Acts as layer between the MDT and MDD.
* Provides MDS-MDS interaction.
* Queries and updates FLD.
* Does the local or remote operation if needed.
* Will do rollback - epoch control, undo logging.

'''CMM functionality'''

CMM chooses all servers involved in operation and sends depended request if needed. The calling of remote MDS is a new feature related to the CMD. CMM mantain the list of MDC to connect with all other MDS.

'''Objects'''

The CMM can allocate two types of object - local and remote. Remote object can occur during metadata operations with more than one object involved. Such operation is called as cross-ref operation.

'''Code'''
lustre/cmm

==MDD==

'''Summary'''

MDD is metadata layer in the new MDS stack, which is the only layer operating the metadata in MDS. The implementation is similar as VFS meta operation but based on OSD storage. MDD API is currently only used in new MDS stack, called by CMM layer.
 
 
In theory, MDD should be local metadata layer, but for compatibility with old MDS stack and reuse some mds codes(llog and lov), a mds device is created and connected to the mdd. So the llog and lov in mdd still use original code through this temporary mds device. And it will be removed when the new llog and lov layer in the new MDS stack are implemented.

'''Code'''
lustre/lustre/mdd/

==recovery==

'''Summary'''

'''Overview'''

Client recovery starts in case when no server reply is received within given timeout or when server tells to client that it is not connected (client was evicted on server earlier for whatever reason).
 
 
The recovery consists of trying to connect to server and then step through several recovery states during which various client-server data is synchronized, namely all requests that were already sent to server but not yet confirmed as received and DLM locks. Should any problems arise during recovery process (be it a timeout or server’s refuse to recognise client again), the recovery is restarted from the very beginning.
 
 
During recovery all new requests to the server are not sent to the server, but added to special delayed requests queue that is then sent once if recovery completes succesfully.

'''Replay and Resend'''

* Clients will go through all the requests in the sending and replay lists and determine the recovery action needed - replay request, resend request, cleanup up associated state for committed requests.
* The client replays requests which were not committed on the server, but for which the client saw reply from server before it failed. This allows the server to replay the changes to the persistent store.
* The client resends requests that were committed on the server, but the client did not see a reply for them, maybe due to server failure or network failure that caused the reply to be lost. This allows the server to reconstruct the reply and send it to the client.
* The client resends requests that the server has not seen at all, these would be all requests with transid higher than the last_rcvd value from the server and the last_committed transno, and the reply seen flag is not set.
* The client gets the last_committed transno information from the server and cleans up the state associated with requests that were committed on the server.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==version recovery==

'''Summary'''

'''Version Based Recovery'''

This recovery technique is based on using versions of objects (inodes) to allow clients to recover later than ordinary server recovery timeframe.

# The server changes the version of object during any change and return that data to client. The version may be checked during replay to be sure that object is the same state during replay as it was originally.
# After failure the server starts recovery as usual but if some client miss the version check will be used for replays.
# Missed client can connect later and try to recover. This is 'delayed recovery' and version check is used during it always.
# The client which missed main recovery window will not be evicted and can connect later to initiate recovery. In that case the versions will checked to determine was that object changed by someone else or not.
# When finished with replay, client and server check if any replay failed on any request because of version mismatch. If not, the client will get a successful reintegration message. If a version mismatch was encountered, the client must be evicted.

'''Code'''

Recovery code is scattered through all code almost. Though important code:
ldlm/ldlm_lib.c - generic server recovery code
ptlrpc/ - client recovery code

==IAM==

'''Summary'''

IAM stands for 'Index Access Module': it is an extension to the ldiskfs directory code, adding generic indexing capability.
 
 
File system directory can be thought of as an index mapping keys, which are strings (file names), to the records which are integers (inode numbers). IAM removes limitations on key and record size and format, providing an abstraction of a transactional container, mapping arbitrary opaque keys into opaque records.
 
 
Implementation notes:

* IAM is implemented as a set of patches to the ldiskfs;
* IAM is an extension of ldiskfs directory code that uses htree data-structure for scalable indexing;
* IAM uses fine-grained key-level and node-level locking (pdirops locking, designed and implemented by Alex Tomas);
* IAM doesn't assume any internal format keys. Keys are compared by memcmp() function (which dictates BE order for scalars);
* IAM supports different flavors of containers:
** lfix: fixed size record and fixed size keys,
** lvar: variable sized records and keys,
** htree: compatibility mode, allowing normal htree directory to be accessed as an IAM container;
* IAM comes with ioctl(2) based user-level interface.

IAM is used by ldiskfs-OSD to implement dt_index_operations interface.
</blockquote>

'''Code'''
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-ops.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-rhel5.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.18-vanilla.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-separate.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-2.6.9-rhel4.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-sles10.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-common.patch
lustre/ldiskfs/kernel_patches/patches/ext3-iam-uapi.patch

==SOM==

'''Summary'''

Size-on-MDS is a metadata improvement, which includes the caching of the inode size, blocks, ctime and mtime on MDS. Such an attribute caching allows clients to avoid making RPCs to the OSTs to find the attributes encoded in the file objects kept on those OSTs what results in the significantly improved performance of listing directories.

'''Code'''
llite/llite_close.c - client-side SOM code
liblustre/file.c - liblustre SOM code
mdt/mdt_handler.c - general handling of SOM-related rpc
mdt/mdt_open.c - MDS side SOM code
mdt/mdt_recovery.c - MDS side SOM recovery code
obdfilter/filter_log.c - OST side IO epoch lloging code;

==tests==

'''Summary'''

The "tests" subsystem is a set of scripts and programs which is used to test other lustre subsystems. It contains:

'''''runtests'''''
 
Simple basic regression test
 
 
'''''sanity'''''
 
A set of regression tests that verify operation under normal operating
conditions
 
 
'''''fsx'''''
 
file system exerciser
 
 
'''''sanityn'''''
 
Tests that verify operations from two clients under normal operating conditions
 
 
'''''lfsck'''''
 
Tests e2fsck and lfsck to detect and fix filesystem corruption
 
 
'''''liblustre'''''
 
Runs a test linked to a liblustre client library
 
 
'''''replay-single'''''
 
A set of unit tests that verify recovery after MDS failure
 
 
'''''conf-sanity'''''
 
A set of unit tests that verify the configuration
 
 
'''''recovery-small'''''
 
A set of unit tests that verify RPC replay after communications failure
 
 
'''''replay-ost-single'''''
 
A set of unit tests that verify recovery after OST failure
 
 
'''''replay-dual'''''
 
A set of unit tests that verify the recovery from two clients after server failure
 
 
'''''insanity'''''
 
A set of tests that verify the multiple concurrent failure conditions
 
 
'''''sanity-quota'''''
 
A set of tests that verify filesystem quotas
 
 
The acceptance-small.sh is a wrapper which is normally used to run all (or any) of these scripts. In additional it is used to run the following pre-installed benchmarks:
 
 
'''''dbench'''''
 
Dbench benchmark for simulating N clients to produce the filesystem load
 
 
'''''bonnie'''''
 
Bonnie++ benchmark for creation, reading, and deleting many small files
 
 
'''''iozone'''''
 
Iozone benchmark for generating and measuring a variety of file operations.

'''Code'''
lustre/tests/acl/run
lustre/tests/acl/make-tree
lustre/tests/acl/README
lustre/tests/acl/setfacl.test
lustre/tests/acl/getfacl-noacl.test
lustre/tests/acl/permissions.test
lustre/tests/acl/inheritance.test
lustre/tests/acl/misc.test
lustre/tests/acl/cp.test
lustre/tests/cfg/local.sh
lustre/tests/cfg/insanity-local.sh
lustre/tests/ll_sparseness_write.c
lustre/tests/writeme.c
lustre/tests/cobd.sh
lustre/tests/test_brw.c
lustre/tests/ll_getstripe_info.c
lustre/tests/lov-sanity.sh
lustre/tests/sleeptest.c
lustre/tests/flocks_test.c
lustre/tests/getdents.c
lustre/tests/ll_dirstripe_verify.c
lustre/tests/sanity.sh
lustre/tests/multifstat.c
lustre/tests/sanityN.sh
lustre/tests/liblustre_sanity_uml.sh
lustre/tests/fsx.c
lustre/tests/small_write.c
lustre/tests/socketserver
lustre/tests/cmknod.c
lustre/tests/README
lustre/tests/acceptance-metadata-double.sh
lustre/tests/writemany.c
lustre/tests/llecho.sh
lustre/tests/lfscktest.sh
lustre/tests/run-llog.sh
lustre/tests/conf-sanity.sh
lustre/tests/mmap_sanity.c
lustre/tests/write_disjoint.c
lustre/tests/ldaptest.c
lustre/tests/acceptance-metadata-single.sh
lustre/tests/compile.sh
lustre/tests/mcreate.c
lustre/tests/runas.c
lustre/tests/replay-single.sh
lustre/tests/lockorder.sh
lustre/tests/test2.c
lustre/tests/llog-test.sh
lustre/tests/fchdir_test.c
lustre/tests/mkdirdeep.c
lustre/tests/runtests
lustre/tests/flock.c
lustre/tests/mlink.c
lustre/tests/checkstat.c
lustre/tests/crash-mod.sh
lustre/tests/multiop.c
lustre/tests/random-reads.c
lustre/tests/disk1_4.zip
lustre/tests/rundbench
lustre/tests/wantedi.c
lustre/tests/rename_many.c
lustre/tests/leak_finder.pl
lustre/tests/Makefile.am
lustre/tests/parallel_grouplock.c
lustre/tests/chownmany.c
lustre/tests/ost_oos.sh
lustre/tests/mkdirmany.c
lustre/tests/directio.c
lustre/tests/insanity.sh
lustre/tests/createmany-mpi.c
lustre/tests/createmany.c
lustre/tests/runiozone
lustre/tests/rmdirmany.c
lustre/tests/replay-ost-single.sh
lustre/tests/mcr.sh
lustre/tests/mrename.c
lustre/tests/sanity-quota.sh
lustre/tests/lp_utils.c
lustre/tests/lp_utils.h
lustre/tests/acceptance-metadata-parallel.sh
lustre/tests/oos.sh
lustre/tests/createdestroy.c
lustre/tests/toexcl.c
lustre/tests/replay-dual.sh
lustre/tests/createtest.c
lustre/tests/munlink.c
lustre/tests/iopentest1.c
lustre/tests/iopentest2.c
lustre/tests/openme.c
lustre/tests/openclose.c
lustre/tests/test-framework.sh
lustre/tests/ll_sparseness_verify.c
lustre/tests/it_test.c
lustre/tests/unlinkmany.c
lustre/tests/opendirunlink.c
lustre/tests/filter_survey.sh
lustre/tests/utime.c
lustre/tests/openunlink.c
lustre/tests/runvmstat
lustre/tests/statmany.c
lustre/tests/create.pl
lustre/tests/oos2.sh
lustre/tests/statone.c
lustre/tests/rename.pl
lustre/tests/set_dates.sh
lustre/tests/openfilleddirunlink.c
lustre/tests/openfile.c
lustre/tests/llmountcleanup.sh
lustre/tests/llmount.sh
lustre/tests/acceptance-small.sh
lustre/tests/truncate.c
lustre/tests/recovery-small.sh
lustre/tests/2ost.sh
lustre/tests/tchmod.c
lustre/tests/socketclient
lustre/tests/runobdstat
lustre/tests/memhog.c
lustre/tests/flock_test.c
lustre/tests/busy.sh
lustre/tests/write_append_truncate.c
lustre/tests/opendevunlink.c
lustre/tests/o_directory.c

==build==

'''Summary'''

The build system is responsible for building Lustre and related components (ldiskfs is normally included in the Lustre tree but can also live completely separately).
 
 
The main build process is managed using GNU Autoconf and Automake. Here is a brief outline of how a Lustre binary build from a fresh Git checkout works. User commands are shown in '''bold'''.

* '''sh autogen.sh''' - autogen performs a few checks and bootstraps the build system using automake and autoconf. It should only need to be called once after a fresh Git checkout, but sometimes it needs to be run again. See [https://bugzilla.lustre.org/show_bug.cgi?id=12580 bug 12580].
**Each component (Lustre and ldiskfs) has an autoMakefile.am in its toplevel directory that sets some variables and includes build/autoMakefile.am.toplevel. It also contains any toplevel autoMake code unique to that component.
** configure.ac is used by autoconf to generate a configure script. The Lustre configure.ac mostly relies on the macros defined in */autoconf/*.m4 to do its work. The ldiskfs configure.ac is more self-contained and relies only on build/autoconf/*.m4.

* '''./configure --with-linux=/root/cfs/kernels/linux-2.6.9-55.EL.HEAD''' - Configure performs extensive checks of the underlying system and kernel, then produces autoMakefiles and Makefiles.

* '''make''' - This is where things get really interesting.
** The @INCLUDE_RULES@ directive in most Makefile.in files includes a whole set of build rules from build/Makefile. See the top of that file for a description of all cases.
** Normally, it will include autoMakefile, so commands from that file will run.
** build/autoMakefile.am.toplevel is the basis of the autoMakefile produced in the toplevel directory. It includes the "modules" target.
** The modules target in turn calls the appropriate Linux make system if we are building on Linux.
** This build system once again reads the Makefile in each directory, and case 2 from build/Makefile is followed.

So essentially, the Makefile.in controls the kernel build process, and the autoMakefile.am controls the userland build process as well as preparing the sources if necessary.
 
 
The build system can also be used to produce Lustre-patched kernels and binaries built against these kernels. The '''build/lbuild''' script does this - this is used by customers as well as the LTS. This script is in need of some serious cleanup, unfortunately.
 
 
Coding style note: as mentioned in [http://wiki.lustre.org/index.php?title=Coding_Guidelines Coding Guidelines], autoconf macros must follow the style specified in the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html GNU Autoconf manual]. A lot of the older code has inconsistent style and is hard to follow - feel free to reformat when needed. New code '''must''' be styled correctly.

'''Code'''

Lustre build system:

* build/* (shared with ldiskfs)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

ldiskfs build system:

* build/* (shared with Lustre)
* autogen.sh
* autoMakefile.am
* configure.ac
* lustre-ldiskfs.spec.in
* Makefile.in
* all autoMakefile.am files
* all Makefile.in files

Acceptance Small (acc-sm) Testing on Lustre

2009-12-15T22:32:55Z

Nathan: /* How do you get the acc-sm tests? */

__TOC__
The Lustre™ QE group and developers use acceptance-small (acc-sm) tests to catch bugs early in the development cycle. Within the Lustre group, acc-sm tests are run on YALA, an automated test system. This information is being published to describe the steps to perform acceptance small testing and encourage wider acc-sm testing in the Lustre community.

'''NOTE''': For your convenience, this document is also available as a [http://wiki.lustre.org/images/c/c6/AccSm_Testing.pdf PDF].

==What is acc-sm testing and why do we use it for Lustre?==

Acceptance small (acc-sm) testing is a suite of test cases used to verify different aspects of Lustre functionality.

* These tests are run using the acceptance-small.sh script.
* The script is run from the lustre/tests directory in a compiled Lustre tree.
* The acceptance-small.sh script runs a number of test scripts that are also run by the ltest (Buffalo) test harness on Lustre test clusters.

==What tests comprise the acc-sm test suite?==

Each Lustre tree contains a lustre/tests sub-directory; all acc-sm tests are stored here. The acceptance-small.sh file contains a list of all tests in the acc-sm suite. To get the list, run:

$ grep TESTSUITE_LIST acceptance-small.sh

The acc-sm tests are listed below, by branch.

====b1_6 branch====

This branch includes 17 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL INSANITY SANITY_QUOTA PERFORMANCE_SANITY"

====b1_8_gate branch====

This branch includes 18 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL REPLAY_VBR INSANITY SANITY_QUOTA PERFORMANCE_SANITY"

====HEAD branch====

This branch includes 19 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL INSANITY SANITY_QUOTA SANITY_SEC SANITY_GSS
PERFORMANCE_SANITY"

To see the test cases in a particular acc-sm test, run:

$ grep run_ <test suite script>

For example, to see the last 3 test cases that comprise the SANITY test:

$ grep run_ sanity.sh | tail -3

run_test 130c "FIEMAP (2-stripe file with hole)"
run_test 130d "FIEMAP (N-stripe file)"
run_test 130e "FIEMAP (test continuation FIEMAP calls)"

==What does each acc-sm test measure or show?==

The acc-sm test suite are described below.

;'''RUNTESTS'''
: A basic regression test with unmount/remount.

;'''SANITY'''
: Verifies Lustre operation under normal operating conditions.

;'''DBENCH'''
:Dbench benchmark for simulating N clients to produce the filesystem load.

;'''BONNIE'''
:Bonnie++ benchmark for creation, reading and deleting many small files

;'''IOZONE'''
:IOzone benchmark for generating and measuring a variety of file operations.

;'''FSX'''
:Filesystem exerciser.

;'''SANITYN'''
:Verifies operations from two clients under normal operating conditions.

;'''LFSCK'''
:Tests e2fsck and lfsck to detect and fix filesystm corruption.

;'''LIBLUSTRE'''
:Runs a test linked to a liblustre client library.

;'''REPLAY_SINGLE'''
:Verifies recovery after an MDS failure.

;'''CONF_SANITY'''
:Verifies various Lustre configurations (including wrong ones), where the system must behave correctly.

;'''RECOVERY_SMALL'''
:Verifies RPC replay after a communications failure (message loss).

;'''REPLAY_OST_SINGLE'''
:Verifies recovery after an OST failure.

;'''REPLAY_DUAL'''
:Verifies recovery from two clients after a server failure.

;'''INSANITY'''
:Tests multiple concurrent failure conditions.

;'''SANITY_QUOTA'''
:Verifies filesystem quotas.

==How do you get the acc-sm tests?==

The acc-sm test suite is stored in the lustre/tests subdirectory.

==Do you have to run every acc-sm test?==

No. You can choose to run only specified acc-sm tests. Tests can be run either with or without the acceptance-sm.sh (acc-sm.sh) wrapper script. Here are several examples:

To only run the RUNTESTS and SANITY tests:

ACC_SM_ONLY=”RUNTESTS SANITY” sh acceptance-small.sh

To only run test_1 and test_2 of the SANITYN tests:

ACC_SM_ONLY=”SANITYN” ONLY=”1 2” sh acceptance-small.sh

To only run the replay-single.sh test and except (not run) the test_3* and test_4* tests:

ACC_SM_ONLY=”REPLAY_SINGLE” REPLAY_SINGLE_EXCEPT=”3 4” sh acceptance-small.sh

To only run conf-sanity.sh tests after #15 (without the acceptance-small.sh wrapper script):

CONF_SANITY_EXCEPT=”$(seq 15)“ sh conf-sanity.sh

==Do the acc-sm tests have to be run in a specific order?==

The test order is defined in the acceptance-small.sh script and in each test script. Users do not have to (and should not) do anything to change the order of tests.

==Who runs the acc-sm tests?==

Currently, the QE group and Lustre developers run acc-sm as the main test suite for Lustre testing. Acc-sm tests are run on YALA, the automated test system, with test reports submitted to Buffalo (a web interface that allows for browsing various Lustre test results). We welcome external contributions to the Lustre acc-sm test efforts – either of the Lustre code base or new testing platforms.

==What type of Lustre environment is needed to run the acc-sm tests? Is anything special needed?==

The default Lustre configuration for acc-sm testing is a single node setup with one MDS and two OSTs. All devices are loop-back devices. YALA, the automated test system, uses a non-default configuration.

To run the acc-sm test suite on a non-default Lustre configuration, you have to modify the default settings in the acc-sm configuration file, lustre/tests/cfg/local.sh. The configuration variables include mds_HOST, ost_HOST, OSTCOUNT, MDS_MOUNT_OPTS and OST_MOUNT_OPTS, among others.

To create your own configuration file, copy cfg/local.sh to cfg/my_config.sh:

cp cfg/local.sh cfg/my_config.sh

Edit the necessary variables in the configuration file (my_config.sh) and run acc-sm as: NAME=my_config sh acceptance-small.sh

==What are the steps to run acc-sm?==

There are two methods to run the acc-sm tests.

1. Check out a Lustre branch (b1_6, b1_8 or HEAD).

2. Change directory to lustre/tests:

cd lustre/tests

3. Build lustre/tests.

4. Run acc-sm on a local, default Lustre configuration (1 MGS/MDT, 1 OST and 1 client):

sh acceptance-small.sh 2>&1 | tee /tmp/output

- OR -

1. Install the lustre-tests RPM (available at lts-head:/var/cache/cfs/PACKAGE/rpm/lustre).

2. Change directory to lustre/tests:

cd /usr/lib/lustre/tests

3. Create your own configuration file and edit it for your configuration.

cp cfg/local.sh cfg/my_config.sh

4. Run acc-sm on a local Lustre configuration.

Here is an example of running acc-sm on a non-default Lustre configuration (MDS is sfire7, OST is sfire8, OSCOUNT=1, etc). In this example, only the SANITY test cases are being run.

ACC_SM_ONLY=SANITY mds_HOST=sfire7 ost8_HOST=sfire8 MDSDEV1=/dev/sda1
OSTCOUNT=1 OSTDEV1=/dev/sda1 MDSSIZE=5000000 OSTSIZE=5000000
MDS_MOUNT_OPTS="-o user_xattr" OST_MOUNT_OPTS=" -o user_xattr"
REFORMAT="--reformat" PDSH="pdsh -S -w" sh acceptance-small.sh

==What if I hit a failure on an acc-sm test?==

* If you regularly hit a failure in any of these tests, check if a bug has been reported on the failure or file a new bug if one has not yet been opened.
* If the bug prevents you from completing the tests, set the environment variables to skip the specific test(s) until you or someone else fixes them.
:* For example, to skip sanity.sh subtest 36g and 65, replay-single.sh subtest 42, and all of insanity.sh set in your environment:

:<pre>
;export SANITY_EXCEPT="36g 65"
;export REPLAY_SINGLE_EXCEPT=42
;export INSANITY=no
</pre>

:* You can also skip tests on the command line. For example, when running acceptance-small:

:<pre>
;SANITY_EXCEPT="36g 65" REPLAY_SINGLE_EXCEPT=42 INSANITY=no ./acceptance-small.sh
</pre>

:* The test framework is very flexible, and it is a very easy "hands-off" way of running testing while you are doing other things, like coding.
:* Questions/problems with the test framework should be emailed to the [http://wiki.lustre.org/index.php/Mailing_Lists lustre-discuss mailing list], so all Lustre users can benefit from improving and documenting it.
* If you do not run the entire test suite regularly, you will have no idea whether a bug is added from your code or not, and you will waste a lot of time looking.

==How do you run acc-sm on a mounted Lustre system?==

To run acc-sm on a Lustre system that is already mounted, you need to use the correct configuration file (according to the mounted Lustre system) and run acc-sm as:

SETUP=: CLEANUP=: FORMAT=: NAME=<config> sh acceptance-small.sh

==How do you run acc-sm with and without reformat?==

By default, the acc-sm test suite does not reformat Lustre. If this is a new system or if you are using new devices and want to reformat Lustre, run acc-sm with REFORMAT="--reformat":

REFORMAT="--reformat" sh acceptance-small.sh

If needed, you can specify WRITECONF="writeconf", and then run acc-sm with WRITECONF="writeconf":

WRITECONF="writeconf" sh acceptance-small.sh

==How do you run acc-sm in a Lustre configuration with several clients?==

The default configuration file for acc-sm is cfg/local.sh, which uses only one client (local). To use additional remote clients, specify the RCLIENTS list and use the cfg/ncli.sh configuration file (or your own copy of ncli configuration).

NAME=ncli RCLIENT=<space-separated list of remote clients> sh acceptance-small.sh

For example:

NAME=ncli RCLIENT="client2 client3 client11" sh acceptance-small.sh

==What is the SLOW variable and how is it used with acc-sm?==

The SLOW variable is used to run a subset of acc-sm tests. By default, the variable is set to SLOW=no, which causes some of the longer acc-sm tests to be skipped and acc-sm test run to complete in less than 2 hours. To run all of the acc-sm tests, set the variable to SLOW=yes:

SLOW=yes sh acceptance-small.sh

==What is the FAIL_ON_ERROR variable and how is it used with acc-sm?==

The FAIL_ON_ERROR variable is used to "stop" or "continue" running acc-sm tests after a test failure occurs. If the variable is set to "true" (FAIL_ON_ERROR=true), then acc-sm stops after test_N fails and test_N+1 does not run. If the variable is set to "false" (FAIL_ON_ERROR=false), then acc-sm continues after test_N fails and test_N+1 does run.

FALSE_ON_ERROR=false, by default, for the sanity, sanityn and sanity-quota tests. FALSE_ON_ERROR=true for the replay/recovery tests.

==What is the PDSH variable and how it is used with acc-sm?==

The PDSH variable is used to provide remote shell access. If acc-sm is run on a Lustre configuration with remote servers, specify PDSH like this:

PDSH="pdsh -S w" sh acceptance-small.sh

If the client has no access to the servers, you can run acc-sm without PDSH, but the tests which need PDSH access are skipped. A summary report is generated which lists the skipped tests.

==What is the LOAD_MODULES_REMOTE variable and how is it used with acc-sm?==

The LOAD_MODULES_REMOTE variable is used to load/unload modules on remote nodes. By default, the variable is set to LOAD_MODULES_REMOTE=false, and modules are not loaded or unloaded on remote nodes during acceptance small testing.

To load/unload modules on remote nodes, set the variable to LOAD_MODULES_REMOTE=true when running the acc-sm tests:

LOAD_MODULES_REMOTE=true sh acceptance-small.sh

==What is the EXCEPT_LIST_FILE variable and how is it used with acc-sm?==

In Lustre 1.8.2 and later, the EXCEPT_LIST_FILE variable can be used to specify the tests-to-skip file, which tracks the tests to skip during acc-sm runs. To specify the EXCEPT_LIST_FILE parameter, set the following in your Lustre environment:

EXCEPT_LIST_FILE=/full/path/to/skip/file #

The tests-to-skip file can also be specified by having a file named tests-to-skip.sh in the LUSTRE/tests/cfg directory. The EXCEPT_LIST_FILE variable will be used if it is defined. Otherwise, the script looks for LUSTRE/tests/cfg/tests-to-skip.sh and uses this file, if it exists.

If a tests-to-skip file is found, its contents are dumped to stdout before it is read into the t-f environment so the file's contents are visible in the rest results. By following a structured format of commenting skip entries, the tests-to-skip.sh file can serve as a log of test failures and help track bugs associated with those failures (for easy reference).

This is a sample tests-to-skip file:

## SAMPLES for ONLYs
#export ACC_SM_ONLY="METADATA_UPDATES"
#export ONLY="25 26 27 28 29"

export SANITY_EXCEPT="${SANITY_EXCEPT} 71" # requires dbench
export SANITY_EXCEPT="${SANITY_EXCEPT} 117" # bz-21361 crashes on raven, single-node acc-sm
export SANITY_EXCEPT="${SANITY_EXCEPT} 900" # does not seem to work on raven

export SANITYN_EXCEPT="${SANITYN_EXCEPT} 16" # bz-21173 test_16 fails with 120 running fsx

export REPLAY_SINGLE_EXCEPT="${REPLAY_SINGLE_EXCEPT} 70b" # bz-19480 - hitting on raven
export OST_POOLS_EXCEPT="${OST_POOLS_EXCEPT} 23" # bz-21224 - uses lfs quotacheck which crashes the node

# entries may be commented out to test fixes when available like this line below
#export REPLAY_DUAL_EXCEPT="${REPLAY_DUAL_EXCEPT} 14b" # bz-19884

# the lines above turn on/off individual test cases
# the lines below turn on/off entire test suites
# lines preceded by comments will be run
# lines which are not commented and set the name of the test suite to "no" will be skipped.

export SLOW="no"
# export RUNTESTS="no"
# export SANITY="no"
# export FSX="no"
# export DBENCH="no"
# export BONNIE="no"
# export IOZONE="no"
# export SANITYN="no"
export LFSCK="no" # 1.8.1: bz 19477
# export LIBLUSTRE="no"
# export RACER="no"
# export REPLAY_SINGLE="no"
# export CONF_SANITY="no"
# export RECOVERY_SMALL="no"
# export REPLAY_OST_SINGLE="no"
# export REPLAY_DUAL="no"
# export REPLAY_VBR="no"
# export INSANITY="no"
# export LARGE_SCALE="no"
export SANITY_QUOTA="no" # bz-21224
# export RECOVERY_MDS_SCALE="no"
# export RECOVERY_DOUBLE_SCALE="no"
# export RECOVERY_RANDOM_SCALE="no"
# export PARALLEL_SCALE="no"
# export METADATA_UPDATES="no"
# export OST_POOLS="no"

==What is the CMD configuration for HEAD?==

For the HEAD branch, specify the MDSCOUNT variable (number of MDTs). By default, the variable is set to 1. If you have a Lustre configuration with several MDT nodes, they need to be specified in the configuration file as mds1_HOST, mds2_HOST, ...

By default, all of these variables are set to the mds_HOST value.

==What do we do with the acc-sm test results?==

If an acc-sm test fails, the failure is investigated. If the investigation reveals there is a Lustre defect, a bug is opened in [https://bugzilla.lustre.org/ Bugzilla] to fix the problem and also the acc-sm issue.

Acceptance Small (acc-sm) Testing on Lustre

2009-12-15T22:32:21Z

Nathan: /* What tests comprise the acc-sm test suite? */

__TOC__
The Lustre™ QE group and developers use acceptance-small (acc-sm) tests to catch bugs early in the development cycle. Within the Lustre group, acc-sm tests are run on YALA, an automated test system. This information is being published to describe the steps to perform acceptance small testing and encourage wider acc-sm testing in the Lustre community.

'''NOTE''': For your convenience, this document is also available as a [http://wiki.lustre.org/images/c/c6/AccSm_Testing.pdf PDF].

==What is acc-sm testing and why do we use it for Lustre?==

Acceptance small (acc-sm) testing is a suite of test cases used to verify different aspects of Lustre functionality.

* These tests are run using the acceptance-small.sh script.
* The script is run from the lustre/tests directory in a compiled Lustre tree.
* The acceptance-small.sh script runs a number of test scripts that are also run by the ltest (Buffalo) test harness on Lustre test clusters.

==What tests comprise the acc-sm test suite?==

Each Lustre tree contains a lustre/tests sub-directory; all acc-sm tests are stored here. The acceptance-small.sh file contains a list of all tests in the acc-sm suite. To get the list, run:

$ grep TESTSUITE_LIST acceptance-small.sh

The acc-sm tests are listed below, by branch.

====b1_6 branch====

This branch includes 17 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL INSANITY SANITY_QUOTA PERFORMANCE_SANITY"

====b1_8_gate branch====

This branch includes 18 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL REPLAY_VBR INSANITY SANITY_QUOTA PERFORMANCE_SANITY"

====HEAD branch====

This branch includes 19 acc-sm test suites.

$ grep TESTSUITE_LIST acceptance-small.sh
export TESTSUITE_LIST="RUNTESTS SANITY DBENCH BONNIE IOZONE FSX SANITYN LFSCK
LIBLUSTRE REPLAY_SINGLE CONF_SANITY RECOVERY_SMALL REPLAY_OST_SINGLE
REPLAY_DUAL INSANITY SANITY_QUOTA SANITY_SEC SANITY_GSS
PERFORMANCE_SANITY"

To see the test cases in a particular acc-sm test, run:

$ grep run_ <test suite script>

For example, to see the last 3 test cases that comprise the SANITY test:

$ grep run_ sanity.sh | tail -3

run_test 130c "FIEMAP (2-stripe file with hole)"
run_test 130d "FIEMAP (N-stripe file)"
run_test 130e "FIEMAP (test continuation FIEMAP calls)"

==What does each acc-sm test measure or show?==

The acc-sm test suite are described below.

;'''RUNTESTS'''
: A basic regression test with unmount/remount.

;'''SANITY'''
: Verifies Lustre operation under normal operating conditions.

;'''DBENCH'''
:Dbench benchmark for simulating N clients to produce the filesystem load.

;'''BONNIE'''
:Bonnie++ benchmark for creation, reading and deleting many small files

;'''IOZONE'''
:IOzone benchmark for generating and measuring a variety of file operations.

;'''FSX'''
:Filesystem exerciser.

;'''SANITYN'''
:Verifies operations from two clients under normal operating conditions.

;'''LFSCK'''
:Tests e2fsck and lfsck to detect and fix filesystm corruption.

;'''LIBLUSTRE'''
:Runs a test linked to a liblustre client library.

;'''REPLAY_SINGLE'''
:Verifies recovery after an MDS failure.

;'''CONF_SANITY'''
:Verifies various Lustre configurations (including wrong ones), where the system must behave correctly.

;'''RECOVERY_SMALL'''
:Verifies RPC replay after a communications failure (message loss).

;'''REPLAY_OST_SINGLE'''
:Verifies recovery after an OST failure.

;'''REPLAY_DUAL'''
:Verifies recovery from two clients after a server failure.

;'''INSANITY'''
:Tests multiple concurrent failure conditions.

;'''SANITY_QUOTA'''
:Verifies filesystem quotas.

==How do you get the acc-sm tests?==

The acc-sm test suite is stored in the lustre/tests sub-directory on each Git branch (b1_6, b1_8, and HEAD).

==Do you have to run every acc-sm test?==

No. You can choose to run only specified acc-sm tests. Tests can be run either with or without the acceptance-sm.sh (acc-sm.sh) wrapper script. Here are several examples:

To only run the RUNTESTS and SANITY tests:

ACC_SM_ONLY=”RUNTESTS SANITY” sh acceptance-small.sh

To only run test_1 and test_2 of the SANITYN tests:

ACC_SM_ONLY=”SANITYN” ONLY=”1 2” sh acceptance-small.sh

To only run the replay-single.sh test and except (not run) the test_3* and test_4* tests:

ACC_SM_ONLY=”REPLAY_SINGLE” REPLAY_SINGLE_EXCEPT=”3 4” sh acceptance-small.sh

To only run conf-sanity.sh tests after #15 (without the acceptance-small.sh wrapper script):

CONF_SANITY_EXCEPT=”$(seq 15)“ sh conf-sanity.sh

==Do the acc-sm tests have to be run in a specific order?==

The test order is defined in the acceptance-small.sh script and in each test script. Users do not have to (and should not) do anything to change the order of tests.

==Who runs the acc-sm tests?==

Currently, the QE group and Lustre developers run acc-sm as the main test suite for Lustre testing. Acc-sm tests are run on YALA, the automated test system, with test reports submitted to Buffalo (a web interface that allows for browsing various Lustre test results). We welcome external contributions to the Lustre acc-sm test efforts – either of the Lustre code base or new testing platforms.

==What type of Lustre environment is needed to run the acc-sm tests? Is anything special needed?==

The default Lustre configuration for acc-sm testing is a single node setup with one MDS and two OSTs. All devices are loop-back devices. YALA, the automated test system, uses a non-default configuration.

To run the acc-sm test suite on a non-default Lustre configuration, you have to modify the default settings in the acc-sm configuration file, lustre/tests/cfg/local.sh. The configuration variables include mds_HOST, ost_HOST, OSTCOUNT, MDS_MOUNT_OPTS and OST_MOUNT_OPTS, among others.

To create your own configuration file, copy cfg/local.sh to cfg/my_config.sh:

cp cfg/local.sh cfg/my_config.sh

Edit the necessary variables in the configuration file (my_config.sh) and run acc-sm as: NAME=my_config sh acceptance-small.sh

==What are the steps to run acc-sm?==

There are two methods to run the acc-sm tests.

1. Check out a Lustre branch (b1_6, b1_8 or HEAD).

2. Change directory to lustre/tests:

cd lustre/tests

3. Build lustre/tests.

4. Run acc-sm on a local, default Lustre configuration (1 MGS/MDT, 1 OST and 1 client):

sh acceptance-small.sh 2>&1 | tee /tmp/output

- OR -

1. Install the lustre-tests RPM (available at lts-head:/var/cache/cfs/PACKAGE/rpm/lustre).

2. Change directory to lustre/tests:

cd /usr/lib/lustre/tests

3. Create your own configuration file and edit it for your configuration.

cp cfg/local.sh cfg/my_config.sh

4. Run acc-sm on a local Lustre configuration.

Here is an example of running acc-sm on a non-default Lustre configuration (MDS is sfire7, OST is sfire8, OSCOUNT=1, etc). In this example, only the SANITY test cases are being run.

ACC_SM_ONLY=SANITY mds_HOST=sfire7 ost8_HOST=sfire8 MDSDEV1=/dev/sda1
OSTCOUNT=1 OSTDEV1=/dev/sda1 MDSSIZE=5000000 OSTSIZE=5000000
MDS_MOUNT_OPTS="-o user_xattr" OST_MOUNT_OPTS=" -o user_xattr"
REFORMAT="--reformat" PDSH="pdsh -S -w" sh acceptance-small.sh

==What if I hit a failure on an acc-sm test?==

* If you regularly hit a failure in any of these tests, check if a bug has been reported on the failure or file a new bug if one has not yet been opened.
* If the bug prevents you from completing the tests, set the environment variables to skip the specific test(s) until you or someone else fixes them.
:* For example, to skip sanity.sh subtest 36g and 65, replay-single.sh subtest 42, and all of insanity.sh set in your environment:

:<pre>
;export SANITY_EXCEPT="36g 65"
;export REPLAY_SINGLE_EXCEPT=42
;export INSANITY=no
</pre>

:* You can also skip tests on the command line. For example, when running acceptance-small:

:<pre>
;SANITY_EXCEPT="36g 65" REPLAY_SINGLE_EXCEPT=42 INSANITY=no ./acceptance-small.sh
</pre>

:* The test framework is very flexible, and it is a very easy "hands-off" way of running testing while you are doing other things, like coding.
:* Questions/problems with the test framework should be emailed to the [http://wiki.lustre.org/index.php/Mailing_Lists lustre-discuss mailing list], so all Lustre users can benefit from improving and documenting it.
* If you do not run the entire test suite regularly, you will have no idea whether a bug is added from your code or not, and you will waste a lot of time looking.

==How do you run acc-sm on a mounted Lustre system?==

To run acc-sm on a Lustre system that is already mounted, you need to use the correct configuration file (according to the mounted Lustre system) and run acc-sm as:

SETUP=: CLEANUP=: FORMAT=: NAME=<config> sh acceptance-small.sh

==How do you run acc-sm with and without reformat?==

By default, the acc-sm test suite does not reformat Lustre. If this is a new system or if you are using new devices and want to reformat Lustre, run acc-sm with REFORMAT="--reformat":

REFORMAT="--reformat" sh acceptance-small.sh

If needed, you can specify WRITECONF="writeconf", and then run acc-sm with WRITECONF="writeconf":

WRITECONF="writeconf" sh acceptance-small.sh

==How do you run acc-sm in a Lustre configuration with several clients?==

The default configuration file for acc-sm is cfg/local.sh, which uses only one client (local). To use additional remote clients, specify the RCLIENTS list and use the cfg/ncli.sh configuration file (or your own copy of ncli configuration).

NAME=ncli RCLIENT=<space-separated list of remote clients> sh acceptance-small.sh

For example:

NAME=ncli RCLIENT="client2 client3 client11" sh acceptance-small.sh

==What is the SLOW variable and how is it used with acc-sm?==

The SLOW variable is used to run a subset of acc-sm tests. By default, the variable is set to SLOW=no, which causes some of the longer acc-sm tests to be skipped and acc-sm test run to complete in less than 2 hours. To run all of the acc-sm tests, set the variable to SLOW=yes:

SLOW=yes sh acceptance-small.sh

==What is the FAIL_ON_ERROR variable and how is it used with acc-sm?==

The FAIL_ON_ERROR variable is used to "stop" or "continue" running acc-sm tests after a test failure occurs. If the variable is set to "true" (FAIL_ON_ERROR=true), then acc-sm stops after test_N fails and test_N+1 does not run. If the variable is set to "false" (FAIL_ON_ERROR=false), then acc-sm continues after test_N fails and test_N+1 does run.

FALSE_ON_ERROR=false, by default, for the sanity, sanityn and sanity-quota tests. FALSE_ON_ERROR=true for the replay/recovery tests.

==What is the PDSH variable and how it is used with acc-sm?==

The PDSH variable is used to provide remote shell access. If acc-sm is run on a Lustre configuration with remote servers, specify PDSH like this:

PDSH="pdsh -S w" sh acceptance-small.sh

If the client has no access to the servers, you can run acc-sm without PDSH, but the tests which need PDSH access are skipped. A summary report is generated which lists the skipped tests.

==What is the LOAD_MODULES_REMOTE variable and how is it used with acc-sm?==

The LOAD_MODULES_REMOTE variable is used to load/unload modules on remote nodes. By default, the variable is set to LOAD_MODULES_REMOTE=false, and modules are not loaded or unloaded on remote nodes during acceptance small testing.

To load/unload modules on remote nodes, set the variable to LOAD_MODULES_REMOTE=true when running the acc-sm tests:

LOAD_MODULES_REMOTE=true sh acceptance-small.sh

==What is the EXCEPT_LIST_FILE variable and how is it used with acc-sm?==

In Lustre 1.8.2 and later, the EXCEPT_LIST_FILE variable can be used to specify the tests-to-skip file, which tracks the tests to skip during acc-sm runs. To specify the EXCEPT_LIST_FILE parameter, set the following in your Lustre environment:

EXCEPT_LIST_FILE=/full/path/to/skip/file #

The tests-to-skip file can also be specified by having a file named tests-to-skip.sh in the LUSTRE/tests/cfg directory. The EXCEPT_LIST_FILE variable will be used if it is defined. Otherwise, the script looks for LUSTRE/tests/cfg/tests-to-skip.sh and uses this file, if it exists.

If a tests-to-skip file is found, its contents are dumped to stdout before it is read into the t-f environment so the file's contents are visible in the rest results. By following a structured format of commenting skip entries, the tests-to-skip.sh file can serve as a log of test failures and help track bugs associated with those failures (for easy reference).

This is a sample tests-to-skip file:

## SAMPLES for ONLYs
#export ACC_SM_ONLY="METADATA_UPDATES"
#export ONLY="25 26 27 28 29"

export SANITY_EXCEPT="${SANITY_EXCEPT} 71" # requires dbench
export SANITY_EXCEPT="${SANITY_EXCEPT} 117" # bz-21361 crashes on raven, single-node acc-sm
export SANITY_EXCEPT="${SANITY_EXCEPT} 900" # does not seem to work on raven

export SANITYN_EXCEPT="${SANITYN_EXCEPT} 16" # bz-21173 test_16 fails with 120 running fsx

export REPLAY_SINGLE_EXCEPT="${REPLAY_SINGLE_EXCEPT} 70b" # bz-19480 - hitting on raven
export OST_POOLS_EXCEPT="${OST_POOLS_EXCEPT} 23" # bz-21224 - uses lfs quotacheck which crashes the node

# entries may be commented out to test fixes when available like this line below
#export REPLAY_DUAL_EXCEPT="${REPLAY_DUAL_EXCEPT} 14b" # bz-19884

# the lines above turn on/off individual test cases
# the lines below turn on/off entire test suites
# lines preceded by comments will be run
# lines which are not commented and set the name of the test suite to "no" will be skipped.

export SLOW="no"
# export RUNTESTS="no"
# export SANITY="no"
# export FSX="no"
# export DBENCH="no"
# export BONNIE="no"
# export IOZONE="no"
# export SANITYN="no"
export LFSCK="no" # 1.8.1: bz 19477
# export LIBLUSTRE="no"
# export RACER="no"
# export REPLAY_SINGLE="no"
# export CONF_SANITY="no"
# export RECOVERY_SMALL="no"
# export REPLAY_OST_SINGLE="no"
# export REPLAY_DUAL="no"
# export REPLAY_VBR="no"
# export INSANITY="no"
# export LARGE_SCALE="no"
export SANITY_QUOTA="no" # bz-21224
# export RECOVERY_MDS_SCALE="no"
# export RECOVERY_DOUBLE_SCALE="no"
# export RECOVERY_RANDOM_SCALE="no"
# export PARALLEL_SCALE="no"
# export METADATA_UPDATES="no"
# export OST_POOLS="no"

==What is the CMD configuration for HEAD?==

For the HEAD branch, specify the MDSCOUNT variable (number of MDTs). By default, the variable is set to 1. If you have a Lustre configuration with several MDT nodes, they need to be specified in the configuration file as mds1_HOST, mds2_HOST, ...

By default, all of these variables are set to the mds_HOST value.

==What do we do with the acc-sm test results?==

If an acc-sm test fails, the failure is investigated. If the investigation reveals there is a Lustre defect, a bug is opened in [https://bugzilla.lustre.org/ Bugzilla] to fix the problem and also the acc-sm issue.

Documenting Code

2009-12-15T22:30:38Z

Nathan: /* Publishing Documention */

Lustre code documentation helps engineers working on the code to read and correctly modify the code. The reader is expected to have a good overall grasp of the [[Lustre_Internals|Lustre architecture and internals]]. The code documentation provides reference information on the application programming interfaces (APIs) and describes significant internal features of each [[Subsystem Map|Lustre subsystem]].

Lustre code documentation consists of stylized comments embedded in the source code, which helps to keep the documentation consistent as the code is developed. The embedded comments can be processed by [http://www.doxygen.org doxygen] into online, browse-able (HTML) documentation.

== Requirements ==

The minimum requirement for documenting Lustre code is to describe subsystem APIs - the datatypes, procedures and globals subsystem exports to the rest of Lustre - and significant internal datatypes. These should be described as follows:

* Datatypes (structs, typedefs, enums)
** What it is for
** Structure members
** Usage constraints
* Procedures
** What it does
** Parameters
** Return values
** Usage constraints
** Subtle implementation details
* Globals
** What it is for
** Usage constraints

The ''most important'' information to include are "Usage constraints" and "Subtle implementation details".

"Usage constraints" are restrictions on how and when you call a procedure or operate on a datastructure. These include concurrency control, reference counting, permitted caller context etc. etc.

"Subtle implementation details" are anything done in the code that might not be transparently obvious, such as code that ensures the last thread in a pool of workers is held in reserve for deadlock avoidance.

A well-chosen descriptive name can allow other information, such as what the procedure does or what a parameter means, to be quite brief or even omitted. But usage constraints and implementation subtleties must always be spelled out, e.g. by describing an object's entire lifecycle from creation through to destruction, so that the next engineer to maintain or use the code does it safely and correctly.

Each time you make a change to the Lustre code or inspect a patch, you must review the changes to ensure:

* Sufficient documentation exists.
* The documentation is accurate and up to date.

== Examples ==

Doxygen comments start with [http://www.doxygen.org/docblocks.html ''/**''] (like in [http://en.wikipedia.org/wiki/Javadoc javadoc]).

Doxygen commands are placed in doxygen comments to control how doxygen formats the output. Commands start with a backslash (''\'') or at-sign (''@''), but we typically use the backslash and reserve the at-sign for group blocks (see below). Don't use doxygen commands unnecessarily.

The main purpose of code documentation is to be available in the code for you to read when you're working on the code. So it's important that the comments read like real C comments and not formatting gibberish.

===Procedures and Globals===
Document procedures and globals in the ''.c'' files, rather than in headers.

<pre>
/**
* Owns a page by IO.
*
* Waits until \a pg is in cl_page_state::CPS_CACHED state, and then switch it
* into cl_page_state::CPS_OWNED state.
*
* \param io IO context which wants to own the page
* \param pg page to be owned
*
* \pre !cl_page_is_owned(pg, io)
* \post result == 0 iff cl_page_is_owned(pg, io)
*
* \retval 0 success
*
* \retval -ve failure, e.g., page was destroyed (and landed in
* cl_page_state::CPS_FREEING instead of cl_page_state::CPS_CACHED).
*
* \see cl_page_disown()
* \see cl_page_operations::cpo_own()
*/
int cl_page_own(const struct lu_env *env, struct cl_io *io, struct cl_page *pg)
</pre>

''Notes:''
* Start with a brief description, which continues to the first '.' (period or full stop).
* Follow the brief description with a detailed description.
* Descriptions are written in the third person singular, e.g. "<this function> does this and that", "<this datatype> represents such and such a concept".
* To refer to a function argument, use the [http://www.doxygen.org/commands.html#cmda ''\a argname''] syntax.
* To refer to another function, use the [http://www.doxygen.org/autolink.html ''funcname()''] syntax. This will produce a cross-reference.
* To refer to a field or an enum value use the [http://www.doxygen.org/autolink.html ''SCOPE::NAME''] syntax.
* Describe possible return values with [http://www.doxygen.org/commands.html#cmdretval ''\retval''].
* Mention all concurrency control restrictions here (such as locks that the function expects to be held, or holds on exit).
* If possible, specify a (weakest) pre-condition and (strongest) post-condition for the function. If conditions cannot be expressed as a C language expression, provide an informal description.
* Enumerate related functions and datatypes in the [http://www.doxygen.org/commands.html#cmdsee ''\see''] section. Note, that doxygen will automatically cross-reference all places where a given function is called (but not through a function pointer) and all functions that it calls, so there is no need to enumerate all this.

===Datatypes===
Document datatypes where they are declared.

<pre>
/**
* "Compound" object, consisting of multiple layers.
*
* Compound object with given fid is unique with given lu_site.
*
* Note, that object does *not* necessary correspond to the real object in the
* persistent storage: object is an anchor for locking and method calling, so
* it is created for things like not-yet-existing child created by mkdir or
* create calls. lu_object_operations::loo_exists() can be used to check
* whether object is backed by persistent storage entity.
*/
struct lu_object_header {
/**
* Object flags from enum lu_object_header_flags. Set and checked
* atomically.
*/
unsigned long loh_flags;
/**
* Object reference count. Protected by lu_site::ls_guard.
*/
atomic_t loh_ref;
/**
* Fid, uniquely identifying this object.
*/
struct lu_fid loh_fid;
/**
* Common object attributes, cached for efficiency. From enum
* lu_object_header_attr.
*/
__u32 loh_attr;
/**
* Linkage into per-site hash table. Protected by lu_site::ls_guard.
*/
struct hlist_node loh_hash;
/**
* Linkage into per-site LRU list. Protected by lu_site::ls_guard.
*/
struct list_head loh_lru;
/**
* Linkage into list of layers. Never modified once set (except lately
* during object destruction). No locking is necessary.
*/
struct list_head loh_layers;
};
</pre>

Describe datatype invariants (preferably formally).

<pre>
/**
* Fields are protected by the lock on cfs_page_t, except for atomics and
* immutables.
*
* \invariant Datatype invariants are in cl_page_invariant(). Basically:
* cl_page::cp_parent and cl_page::cp_child are a well-formed double-linked
* list, consistent with the parent/child pointers in the cl_page::cp_obj and
* cl_page::cp_owner (when set).
*/
struct cl_page {
/** Reference counter. */
atomic_t cp_ref;
</pre>

Describe concurrency control mechanisms for structure fields.

<pre>
/** An object this page is a part of. Immutable after creation. */
struct cl_object *cp_obj;
/** Logical page index within the object. Immutable after creation. */
pgoff_t cp_index;
/** List of slices. Immutable after creation. */
struct list_head cp_layers;
...
};
</pre>

Specify when fields are valid.

<pre>
/**
* Owning IO in cl_page_state::CPS_OWNED state. Sub-page can be owned
* by sub-io.
*/
struct cl_io *cp_owner;
/**
* Owning IO request in cl_page_state::CPS_PAGEOUT and
* cl_page_state::CPS_PAGEIN states. This field is maintained only in
* the top-level pages.
*/
struct cl_req *cp_req;
</pre>

You can use [http://www.doxygen.org/grouping.html#memgroup ''@{'''...'''@}''] syntax to define a subset of fields or ''enum'' values, which should be grouped together.

<pre>
struct cl_object_header {
/** Standard lu_object_header. cl_object::co_lu::lo_header points
* here. */
struct lu_object_header coh_lu;
/** \name locks
* \todo XXX move locks below to the separate cache-lines, they are
* mostly useless otherwise.
*/
/** @{ */
/** Lock protecting page tree. */
spinlock_t coh_page_guard;
/** Lock protecting lock list. */
spinlock_t coh_lock_guard;
/** @} locks */
</pre>

By default, a documenting comment goes immediately before the entity being commented. If it is necessary to place this comment separately (e.g., to streamline comments in the header file), use the following syntax.

<pre>
/** \struct cl_page
* Layered client page.
*
* cl_page: represents a portion of a file, cached in the memory. All pages
* of the given file are of the same size, and are kept in the radix tree
</pre>

===Subsystem Overview===

To document a subsystem, add the following comment to the header file that contains the definitions of its key datatypes. This will group all the documentation in the ''@{''...''@}'' block.

<pre>
/** \defgroup component_name Component Name
*
* overall module documentation
* ...
*
* @{
*/
datatype definitions...
exported functions...
/** @} component_name */
</pre>

The single-word name ''component_name'' identifies a group to doxygen. ''Component Name'' is the printable title of the group. It extends to the end of the line. See [http://doxygen.org/commands.html#cmddefgroup \defgroup] for more details.

To separate a logical part of a larger component, add the following somewhere within the ''\defgroup'' of the component:

<pre>
/**
* \name Printable Title of sub-component
*
* Description of a sub-component
*/
/** @{ */
datatype definitions...
exported functions...
/** @} */
</pre>

If an exported function prototype in a header is located within some group, the appropriate function definition in a ''.c'' file is automatically assigned to the same group.

A set of comments that is not lexically a part of a group can be included into it with the ''\addtogroup'' command. It works just like ''\defgroup'', but the printable group title is optional. See [http://doxygen.org/commands.html#cmdaddtogroup \addtogroup] for full details.

<pre>
/** \addtogroup cl_object
* @{ */
/**
* "Data attributes" of cl_object. Data attributes can be updated
* independently for a sub-object, and top-object's attributes are calculated
* from sub-objects' ones.
*/
struct cl_attr {
/** Object size, in bytes */
loff_t cat_size;
...
};
...
/** @} cl_object */
</pre>

== Running Doxygen ==
You need to install the Graphviz package before you can run doxygen.

Doxygen uses a ''configuration file'' to control how it builds documentation. See [http://www.doxygen.org/config.html Doxygen Configuration] for details.

Lustre comes with two configuration files:
* ''build/doxyfile.ref'' produces a ''short'' form of the documentation set, suitable as a reference. Output is placed into the ''doxygen.ref/'' directory.
* ''build/doxyfile.api'' produces a full documentation set, more suitable for learning code structure. In addition to the short form, this set includes call-graphs and source code excerpts. Output is placed into the ''doxygen.api/'' directory.

If the version of doxygen you are running is newer than the one last used to generate the configuration files, run the following commands to upgrade:
<pre>
doxygen -s -u build/doxyfile.api
doxygen -s -u build/doxyfile.ref
</pre>

To build all the documentation, in the top-level lustre directory, run:
<pre>
doxygen build/doxyfile.api
doxygen build/doxyfile.ref
</pre>

There are also phony Makefile targets ''doxygen-api'' and ''doxygen-ref'' to run these commands and ''doxygen'' to run both.

Note that doxygen currently gives many warnings about undocumented entities. These should abate as we improve the code documentation.

== Publishing Documention ==

The ''build/publish_doxygen'' script publishes a local version of the documentation at "http://wiki.lustre.org/doxygen":

<pre>
build/publish_doxygen [-b branchname] [-l additional-label] [-d] [-u user] [-p port]
</pre>

The default branch is "master". The user and port are used to ''ssh'' into ''shell.lustre.sun.com''. ''User'' defaults to your ''$USER'' environment variable and ''port'' defaults to 922. The ''-d'' option instructs the script to use the current date as a label.

Documentation is uploaded into...

<pre>
user@shell.lustre.sun.com:/home/www/doxygen/$branch$label
</pre>
where ''$label'' is a concatenation of all labels given on the command line in order. The parent directory is ''rsync''-ed to wiki.lustre.org regularly and the documentation can be browsed at...

<pre>
http://wiki.lustre.org:/doxygen
</pre>

When adding a new branch/label, you have to edit ''index.html'' in the doxygen directory on shell.lustre.sun.com.

== Doxygen References ==

[http://www.doxygen.org/ Doxygen Home]

[http://www.doxygen.org/manual.html/ Doxygen Manual]

[http://www.doxygen.org/commands.html/ Doxygen Special Commands]

Documenting Code

2009-12-15T22:29:31Z

Nathan: /* Publishing Documention */

Lustre code documentation helps engineers working on the code to read and correctly modify the code. The reader is expected to have a good overall grasp of the [[Lustre_Internals|Lustre architecture and internals]]. The code documentation provides reference information on the application programming interfaces (APIs) and describes significant internal features of each [[Subsystem Map|Lustre subsystem]].

Lustre code documentation consists of stylized comments embedded in the source code, which helps to keep the documentation consistent as the code is developed. The embedded comments can be processed by [http://www.doxygen.org doxygen] into online, browse-able (HTML) documentation.

== Requirements ==

The minimum requirement for documenting Lustre code is to describe subsystem APIs - the datatypes, procedures and globals subsystem exports to the rest of Lustre - and significant internal datatypes. These should be described as follows:

* Datatypes (structs, typedefs, enums)
** What it is for
** Structure members
** Usage constraints
* Procedures
** What it does
** Parameters
** Return values
** Usage constraints
** Subtle implementation details
* Globals
** What it is for
** Usage constraints

The ''most important'' information to include are "Usage constraints" and "Subtle implementation details".

"Usage constraints" are restrictions on how and when you call a procedure or operate on a datastructure. These include concurrency control, reference counting, permitted caller context etc. etc.

"Subtle implementation details" are anything done in the code that might not be transparently obvious, such as code that ensures the last thread in a pool of workers is held in reserve for deadlock avoidance.

A well-chosen descriptive name can allow other information, such as what the procedure does or what a parameter means, to be quite brief or even omitted. But usage constraints and implementation subtleties must always be spelled out, e.g. by describing an object's entire lifecycle from creation through to destruction, so that the next engineer to maintain or use the code does it safely and correctly.

Each time you make a change to the Lustre code or inspect a patch, you must review the changes to ensure:

* Sufficient documentation exists.
* The documentation is accurate and up to date.

== Examples ==

Doxygen comments start with [http://www.doxygen.org/docblocks.html ''/**''] (like in [http://en.wikipedia.org/wiki/Javadoc javadoc]).

Doxygen commands are placed in doxygen comments to control how doxygen formats the output. Commands start with a backslash (''\'') or at-sign (''@''), but we typically use the backslash and reserve the at-sign for group blocks (see below). Don't use doxygen commands unnecessarily.

The main purpose of code documentation is to be available in the code for you to read when you're working on the code. So it's important that the comments read like real C comments and not formatting gibberish.

===Procedures and Globals===
Document procedures and globals in the ''.c'' files, rather than in headers.

<pre>
/**
* Owns a page by IO.
*
* Waits until \a pg is in cl_page_state::CPS_CACHED state, and then switch it
* into cl_page_state::CPS_OWNED state.
*
* \param io IO context which wants to own the page
* \param pg page to be owned
*
* \pre !cl_page_is_owned(pg, io)
* \post result == 0 iff cl_page_is_owned(pg, io)
*
* \retval 0 success
*
* \retval -ve failure, e.g., page was destroyed (and landed in
* cl_page_state::CPS_FREEING instead of cl_page_state::CPS_CACHED).
*
* \see cl_page_disown()
* \see cl_page_operations::cpo_own()
*/
int cl_page_own(const struct lu_env *env, struct cl_io *io, struct cl_page *pg)
</pre>

''Notes:''
* Start with a brief description, which continues to the first '.' (period or full stop).
* Follow the brief description with a detailed description.
* Descriptions are written in the third person singular, e.g. "<this function> does this and that", "<this datatype> represents such and such a concept".
* To refer to a function argument, use the [http://www.doxygen.org/commands.html#cmda ''\a argname''] syntax.
* To refer to another function, use the [http://www.doxygen.org/autolink.html ''funcname()''] syntax. This will produce a cross-reference.
* To refer to a field or an enum value use the [http://www.doxygen.org/autolink.html ''SCOPE::NAME''] syntax.
* Describe possible return values with [http://www.doxygen.org/commands.html#cmdretval ''\retval''].
* Mention all concurrency control restrictions here (such as locks that the function expects to be held, or holds on exit).
* If possible, specify a (weakest) pre-condition and (strongest) post-condition for the function. If conditions cannot be expressed as a C language expression, provide an informal description.
* Enumerate related functions and datatypes in the [http://www.doxygen.org/commands.html#cmdsee ''\see''] section. Note, that doxygen will automatically cross-reference all places where a given function is called (but not through a function pointer) and all functions that it calls, so there is no need to enumerate all this.

===Datatypes===
Document datatypes where they are declared.

<pre>
/**
* "Compound" object, consisting of multiple layers.
*
* Compound object with given fid is unique with given lu_site.
*
* Note, that object does *not* necessary correspond to the real object in the
* persistent storage: object is an anchor for locking and method calling, so
* it is created for things like not-yet-existing child created by mkdir or
* create calls. lu_object_operations::loo_exists() can be used to check
* whether object is backed by persistent storage entity.
*/
struct lu_object_header {
/**
* Object flags from enum lu_object_header_flags. Set and checked
* atomically.
*/
unsigned long loh_flags;
/**
* Object reference count. Protected by lu_site::ls_guard.
*/
atomic_t loh_ref;
/**
* Fid, uniquely identifying this object.
*/
struct lu_fid loh_fid;
/**
* Common object attributes, cached for efficiency. From enum
* lu_object_header_attr.
*/
__u32 loh_attr;
/**
* Linkage into per-site hash table. Protected by lu_site::ls_guard.
*/
struct hlist_node loh_hash;
/**
* Linkage into per-site LRU list. Protected by lu_site::ls_guard.
*/
struct list_head loh_lru;
/**
* Linkage into list of layers. Never modified once set (except lately
* during object destruction). No locking is necessary.
*/
struct list_head loh_layers;
};
</pre>

Describe datatype invariants (preferably formally).

<pre>
/**
* Fields are protected by the lock on cfs_page_t, except for atomics and
* immutables.
*
* \invariant Datatype invariants are in cl_page_invariant(). Basically:
* cl_page::cp_parent and cl_page::cp_child are a well-formed double-linked
* list, consistent with the parent/child pointers in the cl_page::cp_obj and
* cl_page::cp_owner (when set).
*/
struct cl_page {
/** Reference counter. */
atomic_t cp_ref;
</pre>

Describe concurrency control mechanisms for structure fields.

<pre>
/** An object this page is a part of. Immutable after creation. */
struct cl_object *cp_obj;
/** Logical page index within the object. Immutable after creation. */
pgoff_t cp_index;
/** List of slices. Immutable after creation. */
struct list_head cp_layers;
...
};
</pre>

Specify when fields are valid.

<pre>
/**
* Owning IO in cl_page_state::CPS_OWNED state. Sub-page can be owned
* by sub-io.
*/
struct cl_io *cp_owner;
/**
* Owning IO request in cl_page_state::CPS_PAGEOUT and
* cl_page_state::CPS_PAGEIN states. This field is maintained only in
* the top-level pages.
*/
struct cl_req *cp_req;
</pre>

You can use [http://www.doxygen.org/grouping.html#memgroup ''@{'''...'''@}''] syntax to define a subset of fields or ''enum'' values, which should be grouped together.

<pre>
struct cl_object_header {
/** Standard lu_object_header. cl_object::co_lu::lo_header points
* here. */
struct lu_object_header coh_lu;
/** \name locks
* \todo XXX move locks below to the separate cache-lines, they are
* mostly useless otherwise.
*/
/** @{ */
/** Lock protecting page tree. */
spinlock_t coh_page_guard;
/** Lock protecting lock list. */
spinlock_t coh_lock_guard;
/** @} locks */
</pre>

By default, a documenting comment goes immediately before the entity being commented. If it is necessary to place this comment separately (e.g., to streamline comments in the header file), use the following syntax.

<pre>
/** \struct cl_page
* Layered client page.
*
* cl_page: represents a portion of a file, cached in the memory. All pages
* of the given file are of the same size, and are kept in the radix tree
</pre>

===Subsystem Overview===

To document a subsystem, add the following comment to the header file that contains the definitions of its key datatypes. This will group all the documentation in the ''@{''...''@}'' block.

<pre>
/** \defgroup component_name Component Name
*
* overall module documentation
* ...
*
* @{
*/
datatype definitions...
exported functions...
/** @} component_name */
</pre>

The single-word name ''component_name'' identifies a group to doxygen. ''Component Name'' is the printable title of the group. It extends to the end of the line. See [http://doxygen.org/commands.html#cmddefgroup \defgroup] for more details.

To separate a logical part of a larger component, add the following somewhere within the ''\defgroup'' of the component:

<pre>
/**
* \name Printable Title of sub-component
*
* Description of a sub-component
*/
/** @{ */
datatype definitions...
exported functions...
/** @} */
</pre>

If an exported function prototype in a header is located within some group, the appropriate function definition in a ''.c'' file is automatically assigned to the same group.

A set of comments that is not lexically a part of a group can be included into it with the ''\addtogroup'' command. It works just like ''\defgroup'', but the printable group title is optional. See [http://doxygen.org/commands.html#cmdaddtogroup \addtogroup] for full details.

<pre>
/** \addtogroup cl_object
* @{ */
/**
* "Data attributes" of cl_object. Data attributes can be updated
* independently for a sub-object, and top-object's attributes are calculated
* from sub-objects' ones.
*/
struct cl_attr {
/** Object size, in bytes */
loff_t cat_size;
...
};
...
/** @} cl_object */
</pre>

== Running Doxygen ==
You need to install the Graphviz package before you can run doxygen.

Doxygen uses a ''configuration file'' to control how it builds documentation. See [http://www.doxygen.org/config.html Doxygen Configuration] for details.

Lustre comes with two configuration files:
* ''build/doxyfile.ref'' produces a ''short'' form of the documentation set, suitable as a reference. Output is placed into the ''doxygen.ref/'' directory.
* ''build/doxyfile.api'' produces a full documentation set, more suitable for learning code structure. In addition to the short form, this set includes call-graphs and source code excerpts. Output is placed into the ''doxygen.api/'' directory.

If the version of doxygen you are running is newer than the one last used to generate the configuration files, run the following commands to upgrade:
<pre>
doxygen -s -u build/doxyfile.api
doxygen -s -u build/doxyfile.ref
</pre>

To build all the documentation, in the top-level lustre directory, run:
<pre>
doxygen build/doxyfile.api
doxygen build/doxyfile.ref
</pre>

There are also phony Makefile targets ''doxygen-api'' and ''doxygen-ref'' to run these commands and ''doxygen'' to run both.

Note that doxygen currently gives many warnings about undocumented entities. These should abate as we improve the code documentation.

== Publishing Documention ==

The ''build/publish_doxygen'' script publishes a local version of the documentation at "http://wiki.lustre.org/doxygen":

<pre>
build/publish_doxygen [-b branchname] [-l additional-label] [-d] [-u user] [-p port]
</pre>

The script tries to guess the branchname by looking at the current branch. The user and port are used to ''ssh'' into ''shell.lustre.sun.com''. ''User'' defaults to your ''$USER'' environment variable and ''port'' defaults to 922. The ''-d'' option instructs the script to use the current date as a label.

Documentation is uploaded into...

<pre>
user@shell.lustre.sun.com:/home/www/doxygen/$branch$label
</pre>
where ''$label'' is a concatenation of all labels given on the command line in order. The parent directory is ''rsync''-ed to wiki.lustre.org regularly and the documentation can be browsed at...

<pre>
http://wiki.lustre.org:/doxygen
</pre>

When adding a new branch/label, you have to edit ''index.html'' in the doxygen directory on shell.lustre.sun.com.

== Doxygen References ==

[http://www.doxygen.org/ Doxygen Home]

[http://www.doxygen.org/manual.html/ Doxygen Manual]

[http://www.doxygen.org/commands.html/ Doxygen Special Commands]

Migrating to Git

2009-12-14T20:13:23Z

Nathan:

To migrate ongoing work from CVS to git SCM, first convert your work to a patch, then apply that patch to a git tree.
== Convert your work to patches ==
=== Work that does ''NOT'' live in a private CVS branch ===
If you maintain your development code with quilt or something else other than a CVS private branch, generate patches for any current work.
cvs diff > my.patch

=== Work from a private CVS branch ===
1. In your branch CVS working tree, use 'cvs diff' against the base tree divergence point. Since you've been using the build/merge scripts, this is quite easy. For example, params tree branch:
cvs diff -r HD_PARAMS_TREE_BASE > hd_params_tree.patch
The merge scripts have kept the <branchname>_BASE tag updated to reflect the latest merge (don't use an old dated tag, use the one ending in _BASE).

This cvs diff will include uncommitted changes in your working tree as well as all your committed code, so make sure your working directory tree is in the state you want.

2. Inspect your patch to make sure it is correct. It will be a patch against the divergence point, so realize that the base branch may have moved on and your patch may have to be updated when you apply it.

== Apply your patches to a git repo ==
Obtain a clone of the [https://wikis.lustre.org/intra/index.php/Lustre_GIT#Getting_the_Lustre_repository lustre repository].
git clone --origin prime git@git.lustre.org:prime/lustre <mydir>
cd <mydir>
Create your own private branch. For example, a branch for bug 20000 based off of HEAD:
git checkout -b bug20000 master
Apply the patch to that branch
patch -p1 < hd_params_tree.patch
Resolve any merge conflicts, and commit the patch (to your branch)
git commit -a -v

== Continue development ==
Sun employees should continue development under git as per the [https://wikis.lustre.org/intra/index.php/Lustre_GIT Lustre GIT page].

External contributors should follow the procedure for [http://wiki.lustre.org/index.php/Submitting_Patches submitting patches].

Migrating to Git

2009-12-14T20:08:31Z

Nathan:

To migrate ongoing work from CVS to git SCM, first convert your work to a patch, then apply that patch to a git tree.
== Convert your work to patches ==
=== Work that does ''NOT'' live in a private CVS branch ===
If you maintain your development code with quilt or something else other than a CVS private branch, generate patches for any current work.
cvs diff > my.patch

=== Work from a private CVS branch ===
1. In your branch CVS working tree, use 'cvs diff' against the base tree divergence point. Since you've been using the build/merge scripts, this is quite easy. For example, params tree branch:
cvs diff -r HD_PARAMS_TREE_BASE > hd_params_tree.patch
The merge scripts have kept the <branchname>_BASE tag updated to reflect the latest merge (don't use an old dated tag, use the one ending in _BASE).

This cvs diff will include uncommitted changes in your working tree as well as all your committed code, so make sure your working directory tree is in the state you want.

2. Inspect your patch to make sure it is correct. It will be a patch against the divergence point, so realize that the base branch may have moved on and your patch may have to be updated when you apply it.

== Apply your patches to a git repo ==
Obtain a clone of the [https://wikis.lustre.org/intra/index.php/Lustre_GIT#Getting_the_Lustre_repository lustre repository].
git clone --origin prime git@git.lustre.org:prime/lustre <mydir>
cd <mydir>
Create your own private branch. For example, a branch for bug 20000 based off of HEAD:
git checkout -b bug20000 master
Apply the patch to that branch
patch -p1 < hd_params_tree.patch
Resolve any merge conflicts, and commit the patch (to your branch)
git commit -a -v

Sun employees should continue development under git as per the [https://wikis.lustre.org/intra/index.php/Lustre_GIT Lustre GIT page]

Migrating to Git

2009-12-14T19:58:08Z

Nathan: add content

To migrate ongoing work from CVS to git SCM, first convert your work to a patch, then apply that patch to a git tree.
== Convert your work to patches ==
=== Work that does ''NOT'' live in a private CVS branch ===
If you maintain your development code with quilt or something else other than a CVS private branch, generate patches for any current work.
cvs diff > my.patch

=== Work from a private CVS branch ===
1. In your branch CVS working tree, use 'cvs diff' against the base tree divergence point. Since you've been using the build/merge scripts, this is quite easy. For example, params tree branch:
cvs diff -r HD_PARAMS_TREE_BASE > hd_params_tree.patch
The merge scripts have kept the <branchname>_BASE tag updated to reflect the latest merge (don't use an old dated tag, use the one ending in _BASE).

This cvs diff will include uncommitted changes in your working tree as well as all your committed code, so make sure your working directory tree is in the state you want.

2. Inspect your patch to make sure it is correct. It will be a patch against the divergence point, so realize that the base branch may have moved on and your patch may have to be updated when you apply it.

== Apply your patches to a git repo ==
Obtain a clone of the [https://wikis.lustre.org/intra/index.php/Lustre_GIT#Getting_the_Lustre_repository lustre repository].
git clone --origin prime git@git.lustre.org:prime/lustre <mydir>
cd <mydir>
Create your own private branch. For example, a branch for bug 20000 based off of HEAD:
git checkout -b bug20000 master
Apply the patch to that branch
patch -p1 < hd_params_tree.patch
Resolve any merge conflicts, and commit the patch (to your branch)
git commit -a -v

Now continue development under git as per the [https://wikis.lustre.org/intra/index.php/Lustre_GIT Lustre GIT page]

Architecture - HSM Migration

2009-04-21T20:31:35Z

Nathan: /* Version 1 ("simple"): "Migration on open" policy */

== Purpose ==

This page describes use cases and high-level architecture for migrating files between Lustre and a HSM system.

== Definitions ==

; Trigger : A process or event in the file system which causes a migration to take place (or be denied).
; Coordinator : A service coordinating migration of data.
; Agent : A service used by coordinators to move data or cancel such movement.
; Mover : The userspace component of the '''agent''' which copies the file between Lustre and the HSM storage.
; Copy tool : HSM-specific component of the '''mover'''. (May be the entire mover.)
; I/O request : This term groups read requests, write requests and other metadata accesses like truncate or unlink.
; Resident : A file whose working copy is in Lustre.
; Release : A released file's data has been removed by Lustre after being copied to the HSM. The MDT retains metadata info for the file.
; Archive : An archived file's data resides in the HSM. File data may or may not also reside in Lustre. The MDT retains metadata info for the file.
; Restore : Copy a file from the HSM back into Lustre to make an Archived file Resident.
; Prestage : An explicit call (from User or Policy Engine) to Restore a Released file

== Use cases ==

=== Summary ===

{| border=1 cellspacing=0
|-
!id !! quality attribute !! summary
|-
|Restore || availability || If a file is accessed from the primary storage system (Lustre), but resides in the backend storage system (HSM) it can be relocated in the primary storage system and made available.
|-
|Archive || availability, usability || The system can copy files from the primary to the backend storage system.
|-
|access-during-archive || performance, usability || When a file is migrating and is accessed, the migration may be aborted.
|-
|component-failure || availability || If a migration component suffers a failure. The migration is resumed or aborted depending on the migration state and the failing component.
|-
|unlink || availability || When a Lustre object is deleted, the MDTs or the OSTs request the needed removal from the HSM when needed, depending on policy.
|-
|}

=== Restore (aka cache-miss aka copyin) ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || File data is copied from a HSM into Lustre transparently
|-align="left"
|colspan=2|'''Business Goals:''' || Automatically provides filesystem access to all files in the HSM
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| A released file is accessed
|-align="left"
|'''Stimulus source:'''|| A client tries to open a released file, or an explicit call to prestage
|-align="left"
|'''Environment:'''|| Released and restored files in Lustre
|-align="left"
|'''Artifact:'''|| Released file
|-align="left"
|'''Response:'''|| Block the client open request. Start a file transfer from the HSM to Lustre using the dedicated copy tool. As soon as the data are fully available, reply to the client. Tag the file as resident.
|-align="left"
|'''Response measure:'''|| The file is 100 % resident in Lustre, the open completes without error or timeout.
|-align="left"
|colspan=2|'''Questions:'''|| None.
|-align="left"
|colspan=2|'''Issues:'''|| None.
|-
|}

=== Archive (aka copyout) ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || Some files are copied to the HSM from Lustre.
|-align="left"
|colspan=2|'''Business Goals:''' || Provides a large capacity archiving system for Lustre transparently.
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability, usability.
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| Explicit request
|-align="left"
|'''Stimulus source:'''|| Administrator/User or policy engine.
|-align="left"
|'''Environment:'''|| Candidate files from the policy.
|-align="left"
|'''Artifact:'''|| Files matching the policy.
|-align="left"
|'''Response:'''|| The coordinator, receiving the request, starts a transfer between Lustre and the HSM for selected files. A dedicated agent is called and will spawn the copy tool to manage the transfer. When the transfer is completed, the Lustre file is tagged as Archived and Not Dirty on the MDT.
|-align="left"
|'''Response measure:'''|| Policy compliance.
|-align="left"
|colspan=2|'''Questions:'''|| None.
|-align="left"
|colspan=2|'''Issues:'''|| None.
|-
|}

=== access-during-archive ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || The file being archived is accessed during the archival process.
|-align="left"
|colspan=2|'''Business Goals:''' || Optimize file accesses.
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Usability.
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| File access on migrating file.
|-align="left"
|'''Stimulus source:'''|| A process on any client opens a file or requests some specific actions.
|-align="left"
|'''Environment:'''|| File currently undergoing archival.
|-align="left"
|'''Artifact:'''|| File metadata.
|-align="left"
|'''Response:'''|| Files that are modified during archival must at no point be marked as up-to-date in the HSM until it is copied completely and coherently.
|-align="left"
|'''Response measure:'''|| Policy engine must re-queue file for archival. MDT disallows release of the file.
|-align="left"
|colspan=2|'''Questions:'''|| None.
|-align="left"
|colspan=2|'''Issues:'''|| None.
|-
|}

=== component-failure ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || A transfer component fails during a migration.
|-align="left"
|colspan=2|'''Business Goals:''' || Interrupted jobs should be restarted. Filesystem and HSM must remain coherent.
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability.
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| A component peer reaches a timeout exchanging data with a specific component.
|-align="left"
|'''Stimulus source:'''|| Any component failure (client, MDT, agent).
|-align="left"
|'''Environment:'''|| Lustre components used for file migration.
|-align="left"
|'''Artifact:'''|| One component could not be reached anymore.
|-align="left"
|'''Response:'''|| On client failure, the migration is finished anyway. On MDT/coordinator failure, a recovery mechanism using persistent state must be applied. On agent failure, the archive process is aborted and the coordinator will respawn it when necessary. On OST failure, the archive process is delayed and managed like a traditional I/O.
|-align="left"
|'''Response measure:'''|| The system is coherent, no data transfer process is hung.
|-align="left"
|colspan=2|'''Questions:'''|| None.
|-align="left"
|colspan=2|'''Issues:'''|| None.
|-
|}

=== unlink ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || A client unlinks a file in Lustre.
|-align="left"
|colspan=2|'''Business Goals:''' || Do not limit Lustre unlink speed to HSM speed.
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability.
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| A client issues a unlink request on a file in Lustre.
|-align="left"
|'''Stimulus source:'''|| A Lustre client.
|-align="left"
|'''Environment:'''|| A Lustre filesystem with objects archived in the HSM.
|-align="left"
|'''Artifact:'''|| A file archived in the HSM.
|-align="left"
|'''Response:'''|| The file is unlinked normally in Lustre. For each Lustre object removed this way, an unlink request is sent to the coordinator for the corresponding removal. This should be asynchronous and may be delayed by a large time period.
|-align="left"
|'''Response measure:'''|| No file object exists anymore in Lustre or the archive.
|-align="left"
|colspan=2|'''Questions:'''|| None.
|-align="left"
|colspan=2|'''Issues:'''|| None.
|-
|}

== Components ==

===Coordinator===
# dispatches requests to agents; chooses agents
## restore <FIDlist>
## archive <FIDlist>
## unlink <FIDlist>
## abort_action <cookie>
# consolidates repeat requests
# re-queues requests to a new agent if an agent becomes unresponsive (aborts old request)
## agents send regular progress updates to coordinator (e.g. current extent)
## coordinator periodically checks for stuck threads
# coordinator requests are persistent
## all requests coming to the coordinator are kept in llog, cancelled when complete or aborted
# kernel-space service, MDT acts as initiator for copyin
# ioctl interface for all requests. Initiators are policy engine, administrator tool, or MDT for cache-miss.
# Location: a coordinator will be directly integrated with each MDT
## Agents will communicate via MDC
## Connection/reconnection already taken care of; no additional pinging, config
## Client mount option will indicate "agent", connect flag will inform MDT
## MDT already has intimate knowledge of HSM bits (see below) and needs to communicate with coordinator anyhow
## HSM comms can use a new portal and reuse MDT threads.
## Coordinators will handle the same namespace segment as each MDT under CMD

===MDT changes===
# Per-file layout lock
## A new layout lock is created for every file. The lock contains a layout version number.
## Private writer lock is taken by the MDT when allocating/changing file layout (LOV EA).
### The lock is not released until the layout change is complete and the data exist in the new layout.
### The MDT will take group extent locks for the entire file. The group ID will be passed to the agent performing the data transfer.
### The current layout version is stored by the OSTs for each object in the layout.
## Shared reader locks are taken by anyone reading the layout (client opens, lfs getstripe) to get the layout version.
## Anyone taking a new extent lock anywhere in the file includes the layout version. The OST will grant an extent lock only if the layout version included in the RPC matches the object layout version.
# lov EA changes
## flags
### hsm_released: file is not resident on OSTs; only in HSM
### hsm_exists: some version of this fid exists in HSM; maybe partial or outdated
### hsm_dirty: file in HSM is out of date
### hsm_archived: a full copy of this file exists in HSM; if not hsm_dirty, then the HSM copy is current.
## The hsm_released flag is always manipulated under a write layout lock, the other flags are not.
# new ioctls for HSM control:
## HSM_REQUEST: policy engine or admin requests (archive, release, restore, remove, cancel) <FIDlist>
## HSM_STATE_GET: user requests HSM status information on a single file
## HSM_STATE_SET: user sets HSM policy flags for a single file (HSM_NORELEASE, HSM_NOARCHIVE)
## HSM_PROGRESS: copytool reports periodic state of a single request (current extent, error)
## HSM_TAPEFILE_ADD: add an existing archived file into the Lustre filesystem (only metadata is copied).
# changelogs:
## new events for HSM event completion
### restore_complete
### archive_complete
### unlink_complete
## per-event flags used by HSM
### setattr: data_changed (actually mtime_changed for V1)
### archive_complete: hsm_dirty
### all HSM events: hsm_failed

===Agent===

An '''agent''' manages local HSM requests on a client.
# one agent per client max; most clients will not have agents
# consists of two parts
## kernel component receives messages from the coordinator (LNET comms)
### agents and coordinator piggyback comms on MDC/MDT: connections, recovery, etc.
### coordinator uses reverse imports to send RPCs to agents
## userspace process copies data between Lustre and HSM backend
### will use special fid directory for file access (.lustre/fid/XXXX)
### interfaces with hardware-specific copytool to access HSM files
# kernel process passes requests to userspace process via socket

===Copytool===

The copytool copies data between Lustre and the HSM backend, and deletes the HSM object when necessary.
# userspace; runs on a Lustre client with HSM i/o access
# opens objects by fid
# may manipulate HSM mode flags in an EA.
# uses ioctl calls on the (opened-by-fid) file to report progress to MDT. Note MDT must pass some messages on to Coordinator.
## updates progress regularly while waiting for HSM (e.g. every X seconds)
## reports error conditions
## reports current extent
# copytool is HSM-specific, since they must move data to the HSM archive
## version 1 will include tools for HPSS and SAM-QFS
## other, vendor-proprietary (binary) tools may be wrapped in order to include Lustre ioctl progress calls.

===Policy Engine===

# makes policy decisions for archive, release (which files and when)
## policy engine will provide the functionality of the [[Space Manager|Space_Manager]] and any other archive/release policies
## may be based on space available per filesystem, OST, or pool
## may be based on any filesystem or per-file attributes (last access time, file size, file type, etc)
## policy engine will therefore require access to various user-available info: changelogs, getstripe, lfs df, stat, lctl get_param, etc.
# normally uses changelogs and 'df' for input; rarely is allowed to scan filesystem
## changelogs are available to superuser on Lustre clients
## filesystem scans are expensive; allowed only at initial HSM setup time or other rate events
# the policy engine runs as a userspace process; requests archive and release via file ioctl to coordinator (through MDT).
# policy engine may be packaged separately from Lustre
# the policy engine may use HSM-backend specific features (e.g. HPSS storage class) for policy optimizations, but these will be kept modularized so they are easily removed for other systems.
# API can pass an opaque arbitrary chunk of data (char array, size) from policy engine ioctl call through coordinator and agent to copytool.

===Configuration===
# policy engine has it's own external configuration
# coordinator starts as part of MDT; tracks agents registrations as clients connect
## connect flag to indicate agent should run on this MDC
## mdt_set_info RPC for setting agent status using 'remount'

== Scenarios ==

=== restore (aka cache-miss, copyin) ===

====Version 1 ("simple"): "Migration on open" policy ====
Clients block at open for read and write. OSTs are not involved.
# Client layout-intent enqueues layout read lock on the MDT.
# MDT checks hsm_released bit; if released, the MDT takes PW lock on the layout
# MDT creates a new layout with a similar stripe pattern as the original, increasing the layout version, and allocating new objects on new OSTs with the new version.
#: (We should try to respect specific layout settings (pool, stripecount, stripesize), but be flexible if e.g. pool doesn't exist anymore.
#: Maybe we want to ignore stripe offset and/or specific OST allocations in order to rebalance.)
# MDT enqueues group write lock on extents 0-EOF
#: Extents lock enqueue timeout must be very long while group lock is held (need proc tunable here)
# MDT releases PW layout lock
#: Client open succeeds at this point, but r/w is blocked on extent locks
# MDT sends request to coordinator requesting restore of the file to .lustre/fid/XXXX with group lock id and extents 0-EOF. (Extents may be used in the future to (a) copy in part of a file, in low-disk-space situations; (b) copy in individual stripes simultaneously on multiple OSTs.)
# Coordinator distributes that request to an appropriate '''agent'''.
# '''Agent''' starts copytool
# Copytool opens .lustre/fid/<fid>
# Copytool takes group extents lock
# Copytool copies data from HSM, reporting progress via ioctl
# When finished, copytool reports progress of 0-EOF and closes the file, releasing group extents lock.
# MDT clears hsm_released bit
# MDT releases group extents lock
#: This sends a completion AST to the original client, who now receives his extents lock.
# MDT adds FID <fid> HSM_copyin_complete record to changelog (flags: failed)

[[Image:hsm_copyin.png|HSM Copy-in schema]]

====Version 2 ("complex"): "Migration on first I/O" policy ====
Clients are able to read/write the file data as soon as possible and the OSTs need to prevent access to the parts
of the file which have not yet been restored.
# getattr: attributes can be returned from MDT with no HSM involvement
## MDS holds file size[*]
## client may get MDS attribute read locks, but not layout lock

# Client open intent enqueues layout read lock.
# MDT checks "purged" bit
# MDT creates a new layout with a similar stripe pattern as the original, allocating new objects on new OSTs with per-object "purged" bits set.
# MDT grants layout lock to client and open completes
# ?Should we pre-stage: MDT sends request to coordinator requesting copyin of the file to .lustre/fid/XXXX with extents 0-EOF.
# client enqueues extent lock on OST. Must wait forever.
# check OST object is marked fully/partly invalid
## object may have persistent invalid map of extent(s) that indicate which parts of object require copy-in
# access to invalid parts of object trigger copy-in upcall to '''coordinator''' for those extents
## coordinator consolidates repeat requests for the same range (e.g. if entire file has already been queued for copyin, ignore specific range requests??)
# ? group locks on invalid part of file block writes to missing data
# clients block waiting on extent locks for invalid parts of objects
## OST crash at this time will restart enqueue process during replay
# '''coordinator''' contacts '''agent(s)''' to retrieve FID N extents X-Y from HSM
# copytool writes to actual object to be restored with "clear invalid" flag (special write)
## writes by agent shrink invalid extent, periodically update on-disk invalid extent and release locks on that part of file (on commit?)
## note changing lock extents (lock conversion) is not currently possible but is a long-term Lustre performance improvement goal.
# client is granted extent lock when that part of file is copied in

=== copyout ===
# Policy engine (or administrator) decides to copy a file to HSM, executes HSMCopyOut ioctl on file
# ioctl caught by MDT, which passes request to Coordinator
# coordinator dispatches request to mover. Request includes file extents (for future purposes)
# normal extents read lock is taken by mover running on client
# mover sends "copyout begin" message to coordinator via ioctl on the file
## coordinator/MDT sets "hsm_exists" bit and clears "hsm_dirty" bit.
##: "hsm_exists" bit is never cleared, and indicates a copy (maybe partial/out of date) exists in the HSM
# any writes to the file cause the MDT to set the "hsm_dirty" bit (may be lazy/delayed with mtime or filesize change updates on MDT for V1).
## file writes need not cancel copyout (settable via policy? Implementation in V2.)
# mover sends status update to coordinator via periodic ioctl calls on the file (e.g % complete)
# mover sends "copyout done" message to coordinator via ioctl
# coordinator/MDT checks hsm_dirty bit.
## If not dirty, MDT sets "copyout_complete" bit.
## If dirty, coordinator dispatches another copyout request; goto step 3
# MDT adds FID X HSM_copyout_complete record to changelog
# Policy engine notes HSM_copyout_complete record from changelog (flags: failed, dirty)

(Note: files modifications after copyout is complete will have both copyout_complete and hsm_dirty bits set.)

[[Image:hsm_copyout.png|HSM Copy-out schema]]

=== purge (aka punch) ===
==== V1: full file purge ====
# Policy engine (or administrator) decides to purge a file, executes HSMPurge ioctl on file
# ioctl handled by MDT
# MDT takes a write lock on the file layout lock
# MDT enques write locks on all extents of the file. After these are granted, then no client has any dirty cache and no child can take new extent locks until layout lock is released. MDT drops all extent locks.
# MDT verifies that hsm_dirty bit is clear and copyout_complete bit is set
## if not, the file cannot be purged, return EPERM
# MDT marks the LOV EA as "purged"
# MDT sends destroys to the OST objects, using destroy llog entries to guard against object leakage during OST failover
## the OSTs should eventually purge the objects during orphan recovery
# MDT drops layout lock.

==== V2: partial purge ====
Partially purged files hopefully allows graphical file browsers to retrieve file header info or icons stored at the beginning or end of a file.
'''Note: determine exactly which parts of a file that Windows Explorer reads to generate it's icons'''
# MDT sends purge range to first and last objects, and destroys to all intermediate objects, using llog entries for recovery.
# First and last OSTs record purge range
# When requesting copyin of the entire file (first access to the middle of a partially purged file), MDT destroys old partial objects before allocating new layout. (Or: we keep old first and last objects, just allocate new "middle object" striping - yuck.)

=== unlink ===

# A client issues an unlink for a file to the MDT.
# The MDT includes the "hsm_exists" bit in the changelog unlink entry
# The policy engine determines if the file should be removed from HSM
# Policy engine sends HSMunlink FID to coordinator via MDT ioctl
## ioctl will be on the directory .lustre/fid
##: or perhaps on a new .lustre/dev/XXX where any lustre device may be listed, and act as stub files for handling ioctls.
# The coordinator sends a request to one of its agent for the corresponding removal.
# The agent spawns the HSM tool to do this removal.
# HSM tool reports completion via another MDT ioctl
# Coordinator cancels unlink request record
## In case of agent crash, unlink request will remain uncancelled and coordinator will eventually requeue
## In case of coordinator crash, agent ioctl will proceed after recovery
# Policy engine notes HSM_unlink_complete record from changelog (flags: failed)

=== abort ===
# abort dead agent
#: the coordinator must send an abort signal to an agent to abort a copyout/copyin if it determines the migration is stuck/crashed. The coordinator can then re-queue the migration request elsewhere.
# dirty-while-copyout
#: If a file is written to while it is being copied out, the copyout will have an incoherent copy in some cases.
## We could send abort signal, but:
## If a filesystem has a single massive file that is used all the time, it will never get backed up if we abort.
## Not a problem if just appending to a file
## Most backup systems work this way with relatively little harm.
## V1: don't abort this case
## V2: abort in this case is a settable policy

== Recovery ==
=== MDT crash ===
# MDT crashes and is restarted.
# The coordinator recreates its migration list, reading the its llog.
# The client, when doing its recovery with the MDT, reconnects to the coordinator.
## Copytool eventually sends its periodic status update for migrating files (asynchronously from reconnect).
## As far as the copytools/agent is concerned, the MDT restart is invisible.

Note: The migration list is simply the list of unfinished migrations which may be read from the llog
at any time (no need to keep it in memory all the time, if there are many open migration requests).

Logs should contain:
# fid, request type, agent_id (for aborts)
# if the list is not kept in memory: last_status_update_time, last_status.

=== Client crash ===
# Client stops communicating with MDT
# MDT evicts client
# Eviction triggers coordinator to re-dispatch immediately all of the migrations from that agent
# For copyin, it is desireable that any existing agent I/O is stopped
## Ghost client and copytool may still be alive and communicating with OSTs, but not MDT. Can't send abort.
## Taking file extent locks will only temporarily stop ghost.
## It's not so bad if new agent and ghost are racing trying to copyin the file at the same time.
### Regular extent locks prevent file corruption
### The file data being copied in is the same
### Ghost copyin may still be ongoing after new copyin has finished, in which case ghost may
overwrite newly-modified data (data modified by regular clients after HSM/Lustre think copyin is complete.)

=== Copytool crash ===
Copytool crash is different from a client crash, since the client will not get evicted
# Copytool crashes
# Coordinator notices no status updates
# Coordinator sends abort signal to old agent
# Coordinator re-dispatches migration

== Implementation constraints ==

# all single-file coherency issues are in kernel space (file locking, recovery)
# all policy decisions are in user space (using changelogs, df, etc)
# coordinator/mover communication will use LNET
# Version 1 HSM is a simplified implementation:
## integration with HPSS only
## depends on changelog for policy decisions
## restore on file open, not data read/write
# HSM tracks entire files, not stripe objects
# HSM namespace is flat, all files are addressed by FID only
# Coordinator and movers can be reused by (non-HSM) replication

== HSM Migration components & interactions ==
Note: for V1, copyin initiators are on MDT only (file open).
[[Image:hsm_migration.png]]

== For further review/detail ==

# "complex" HSM roadmap
## partial access to files during restore
## partial purging for file type identification, image thumbnails, ??
## integration with other HSM backends (ADM, ??)
# How can layout locks be held in liblustre

= References =
[[Category:Architecture|HSM migration]]

[https://bugzilla.lustre.org/show_bug.cgi?id=15599 HSM implementation 15599]
[https://bugzilla.lustre.org/show_bug.cgi?id=15699 changelogs 15699]

Main Page

2008-05-06T18:06:01Z

Nathan: fix mailing list link

== What is Lustre? ==

Lustre is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Sun Microsystems, Inc.

The central goal is the development of a next-generation cluster file system which can serve clusters with 10,000's of nodes, provide petabytes of storage, and move 100's of GB/sec with state-of-the-art security and management infrastructure.

Lustre runs on many of the largest Linux clusters in the world, and is included by Suns's partners as a core component of their cluster offering (examples include HP StorageWorks SFS, and the Cray XT3 and XD1 supercomputers). Today's users have also demonstrated that Lustre scales down as well as it scales up, and runs in production on clusters as small as 4 and as large as 25,000 nodes.

The latest version of Lustre is always available from Sun Microsystems, Inc. Public Open Source releases of Lustre are available under the GNU General Public License. These releases are found here, and are used in production supercomputing environments worldwide.

To be informed of Lustre releases, subscribe to the [http://wiki.lustre.org/index.php?title=Mailing_Lists lustre-announce] mailing list.

Lustre development would not have been possible without funding and guidance from many organizations, including several U.S. National Laboratories, early adopters, and product partners.

== User Resources ==

* [http://www.sun.com/software/products/lustre/get.jsp Lustre Downloads]
* [http://wiki.lustre.org/index.php?title=Lustre_Quick_Start Lustre Quick Start]
* [http://wiki.lustre.org/index.php?title=Mailing_Lists Mailing Lists]
* [http://manual.lustre.org/index.php?title=Main_Page Lustre Operations Manual]
* [http://wiki.lustre.org/index.php?title=Bug_Filing Filing Bugs]
* [https://bugzilla.lustre.org/showdependencytree.cgi?id=2374 Lustre Knowledge Base]

== Advanced User Resources ==

*[http://wiki.lustre.org/index.php?title=BuildLustre How to build Lustre]
* [http://wiki.lustre.org/index.php?title=Kerb_Lustre Kerberos]
* [http://wiki.lustre.org/index.php?title=LustreTuning Lustre Tuning]
* [http://wiki.lustre.org/images/7/78/LustreManual.html#Chapter_III-2._LustreProc LustreProc] - A guide on the proc tunable parameters for Lustre and their usage. It describes several of the proc tunables including those that effect the client's RPC behavior and prepare for a substantial reorganization of proc entries.
* [http://wiki.lustre.org/index.php?title=LibLustre_HowTo Liblustre HowTo]

== Lustre Centres of Excellence™ ==

* [http://ornl-lce.clusterfs.com/index.php?title=Main_Page ORNL]
* [http://www.clusterfs-mwiki.com/cea-lce CEA]
* [http://www.clusterfs-mwiki.com/llnl-lce LLNL]
* [http://www.clusterfs-mwiki.com/psc-lce/index.php?title=Main_Page PSC]
* [http://www.clusterfs-mwiki.com/tsinghua-lce Tsinghua]

== Developer Resources ==
* [http://arch.lustre.org Lustre Architecture]
* [http://wiki.lustre.org/index.php?title=Contribution_Policy Contribution Policy]
* [http://lists.lustre.org/mailman/listinfo Developer Mailing List]
* CVS usage
** [http://wiki.lustre.org/index.php?title=Open_CVS CVS access to Lustre Source]
** [http://wiki.lustre.org/index.php?title=Cvs_Branches CVS Branches] - How to manage branches with CVS.
** [http://wiki.lustre.org/index.php?title=Cvs_Tips CVS Tips] - Helpful things to know while using Lustre CVS.
* [http://wiki.lustre.org/index.php?title=Lustre_Debugging Debugging Lustre] - A guide to debugging Lustre.
* [http://wiki.lustre.org/index.php?title=ZFS_Resources ZFS Resources] - Learn about ZFS.
* [http://wiki.lustre.org/index.php?title=Coding_Guidelines Coding Guidelines] - Developer guidelines to avoid problems during Lustre code merges.
* [http://wiki.lustre.org/index.php?title=Documenting_Code Documenting Code with Doxygen]

== CFS Development Projects ==

* [http://wiki.lustre.org/index.php?title=IOPerformanceProject I/O Performance]
* [http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU Lustre OSS/MDS with ZFS DMU]

== Community Development Projects ==
* [http://wiki.lustre.org/index.php?title=Networking_Development Networking Development]
* [http://wiki.lustre.org/index.php?title=Diskless_Booting Diskless Booting]
* [http://wiki.lustre.org/index.php?title=Drbd_And_Lustre DRBD and Lustre]
* [http://www.bullopensource.org/lustre Bull- Open Source tools for Lustre]
* [http://www.sourceforge.net/projects/lmt LLNL- Lustre Monitoring Tool]

== Other Resources ==

* [http://wiki.lustre.org/index.php?title=Lustre_Publications Lustre Publications] - Papers and presentations about Lustre
* Lustre User Group
** [http://wiki.lustre.org/index.php?title=Lug_08 '''VIDEOS'''] and [http://picasaweb.google.com/overheardinpdx/LustreUserGroup2008 PHOTOS] from LUG2008 are now posted.
** [http://wiki.lustre.org/index.php?title=Lug_08 Lustre User Group 2008]
** [http://wiki.lustre.org/index.php?title=Lug_07 Lustre User Group 2007]
** [http://wiki.lustre.org/index.php?title=Lug_06 Lustre User Group 2006]
** LUG Requirements Forum - [http://wiki.lustre.org/images/7/78/LUG-Requirements-060420-final.pdf LUG-Requirements-060420-final.pdf] | [http://wiki.lustre.org/images/7/78/LUG-Requirements-060420-final.xls LUG-Requirements-060420-final.xls]

Architecture - Feature FS Replication

2008-04-09T21:27:04Z

Nathan: /* Exported Interfaces */

== Summary ==

This article describes a feature to facilitate efficient replication of large Lustre filesystems. Target filesystems may be Lustre or any other. This article does not address replication in the presence of clustered metadata.

== Requirements ==

# The solution must scale to large file systems and will avoid full file system scans
# The algorithm must be exact - no modified files will be missed
# If the file system is static and the replication is performed the target will equal the source, barring errors arising during the synchronization
# The algorithms is safe when run repeatedly or run after an aborted attempt, will lead to understandable results when applied to a file system that is being modified during synchronization
# The solutions will use a list of modified files for synchronization
# The solution should have a suitable architecture to synchronize flash caches or object replicas
# The architecture of this solution will be suitable for (i) ldiskfs (ii) Lustre 1.8 (ldiskfs with MDS&OSS) (iii) Lustre 1.10 (new MDS with OSD based fids)
# The solution will address future changes of log record formats (since these will contain rollback & audit information also in due course)
# The solution may initially only work on file system without hard links (regular files with link count > 1).
# The synchronization mechanism has a facility to switch the role of source and target to perform failover and failback of services.
# The solution must be able to deal with different (future) record formats
# The solution must provide for reverse replication for the recovery case

== Critical use cases ==

{| border=1
| '''identifier''' || '''attribute''' || '''summary'''
|-
| restart ||availability || Metadata synchronization is aborted and restarted. The outcome needs to be correct.
|-
| file system full || correctness || Replication logs can cause the file system to be full. Correct messaging to user space is included in the solution.
|-
| MDT/OST log sync || correctness || OST records new file creation in log, but the event took place after the last MDT log sync
|-
| reverse replication || correctness || A master filesystem that is restarted after a failover to a backup filesystem must be made consistent with the backup
|-
|}

=== restart ===
# Namespace changes may not be repeated (e.g. rm a, mv b a)
# Rename operations may be half-finished on target (e.g. rename, but haven't updated mtime of parent dir yet when power fails, so mtime is 'now' on target, but should be 'rename time' instead)

=== MDT/OST log sync ===
Correctly deal with the following cases: after a namespace log sync (epoch 1):
# New file is created
#: Ignore OST updates to files that were created after the epoch. The creation will be noted in the next MDT epoch, at which point the entire file (data and md) must be copied.
#* Sync namespace epoch 1
#* (data record a) Modify file foo, ino=1
#* (namespace record b) mv foo bar
#* (namespace record c) Create file foo, ino=2
#* (data record d) Modify file foo, ino=2
#* Sync data
#** (record a) lookup destname of ino=1: it is foo, so copy ino=1 to dest:/foo
#** (record d) lookup destname of ino=2: when computing destname, we determine it did not exist at epoch 1 (we see create record in the active namespace log); return a special code and don't sync this file. Old "foo" on target is not modified.
#* Sync namespace epoch 2
#** (record b) move dest:/foo to dest:/bar
#** (record c) create dest:/foo; copy ino=2 to dest:/foo
# File is deleted
#: Leave the old version on dest alone. File will be deleted on dest in epoch 2; any recent data changes made on the source before the delete will be lost.
#* See Solution Limitation A
# File moved off of OST (archive, space balance, ?)
#* for file-level replication, this is a non-event
#* for object-level replication, ?? leave special redirect record stub file

=== Reverse Replication ===
filesystem A is "failed over" to a backup system B that was current as of some epoch. When failing back, updates from B must be replicated on A such that A is consistent with B.

FIXME We need to decide something here
Options: Changes on A after the epoch should be:
# reverted
## deleted files can be resurrected from B
## created files should be deleted
## namespace moves are undone
## ? files with mtimes after the epoch are recopied from B
## this assumes we have complete up-to-date changelogs
# kept, except in the case of conflicts
## conflict policy: new master always wins
## conflict policy: latest update wins
## conflict policy: ? something else
## changes since the epoch must be replicated back to B as well

== Definitions ==

; '''Source File System''' : The file system that is being changed and that we want to replicate.
; '''Target File System''' : The file system that we wish to update to be identical to the source file system.
; '''Parent Tracker''' : a subsystem responsible for recording the parent inode in disk file systems inodes, for pathname reconstruction.
; '''Changelog Generator''' : a subsystem used by the MDS and OSS systems to generate changelog entries transactionally when the file system changes.
; '''Changelog Notifier''' : subsystem responsible for notifying that a certain number of entries have been made, OR a certain amount of time has elapsed since changelog entries were made available, OR a large amount of changelog entries remains unprocessed.
; '''Changelog Consumer''' : subsystem reading changelog entries for further processing.
; '''File Data & Attribute Synchronizer''' : subsystem responsible for opening the correct source and target file and synchronizing the data/attributes in the file.
; '''Namespace Synchronizer''' : subsystem responsible for synchronizing the namespace. This subsystem executes creation, deletion and rename operations.
; '''Layout Manager''' : changes MDS object layout information associated with the replicated file data.
; '''Namespace Operation''' : creation (open with create, mknod, symlink, mkdir, link), deletion (unlink, rmdir) or change (rename) of a name
; '''Replication Epoch''' : A sequence of namespace operations, a set of inodes with attribute and file data changes, bracketed by an initial and final time and record number.
; '''Active log''' : The changelog to which the filesystem is currently appending change records.
; '''Staged log''' : One of zero or more 'closed' logs, no longer active.

== Components ==
[[Image:components.png|600px]]

[[Image:deployments.png|600px]]
=== Parent Tracker ===

==== Uses relationships ====
; Used for : current full path lookup given an inode or fid
; Requires : the disk file system extended attribute interfaces to record primary parent inode. An indexed file to record pairs of inodes and secondary parents.

==== Logic ====

* When a new inode is created (creat, mknod, mkdir, open(O_CREAT), symlink, (link)), the parent directory inode number (inum) is automatically added as an EA. On Lustre, since OST inodes are precreated, we will modify the EA in filter_update_fidea. This means that an EA update is now required for these operations, which may change the transaction size by one block.

* Upon rename the parent inode is changed.

* The initial implementation may ignore multiple parents (hardlinks); for replication purposes synchronizing any one of the hardlinks is sufficient.

; modifications : obdfilter (filter_update_fidea), ldiskfs (ldiskfs_new_inode)

=== Changelog Generator ===

==== Uses relationships ====
; Used for : The disk file system on MDS or OSS to record changes in the file system.
; Requires : llog or another transactional efficient logging mechanism to record changes, the file system or VFS api's to create logs

==== Exported Interfaces ====
# changelog accessor

===== Changelog Accessor =====
Changelogs are stored in a hidden directory at the filesystem root (/.changelog/). They are (mode 440 user "audit) files generated automatically by the filesystem when a special mount option is supplied (-o changelog). Only completed ("staged") log files are visible in the directory. Files are named with their changelog file sequence number. The current sequence number is stored on disk in a separate file (/.changelog/sequence).

; modifications : ldiskfs; mdt/ost to allow serve special non-MDT-managed OST files to Lustre clients, mdt to serve special files directly to lustre clients without using OSTs

==== Interactions ====

[[Image:changelog.png|500px]]

==== Features and Limitations ====
* Log records are placed into a special changelog file.
* The records contain fids of existing objects and names of new objects
* The records contain the parent fid
* Changelog records are recorded within the filesystem transaction
* Modified files are noted in the changelog log only once (not at every transaction)
* Every instance of a namespace change operation (rename, create, delete) is recorded in the log

'''Theorem:''' Given a changelog with the properties described above a correct synchronization can be performed.
'''Proof:'''

==== Logic ====
# Global changelog_active_start_time is set to the active changelog's crtime at startup and every time a new active changelog is created.
# When an inode is first read from disk (or created), save the old mtime, so we can easily detect when it changes
# Every time an inode is dirtied, check old_mtime.
## If old_mtime is after changelog_active_start_time, then inode has already been added to changelog; don't add it again.
## If old_mtime is earlier, then add the change record to the active changelog
## Update old_mtime to current mtime. If mtime is not later than changelog_active_start_time, we can use ctime, dtime, or current time instead of mtime; it just needs to be a time after changelog_active_start_time.
# Record every namespace change operation in the log for path reconstruction of a live filesystem. ''Record every instance'', not just once per log. To simplify the log format, move/rename operations may be broken up into two log records (the first with the old name/parent, the second with the new).
## Include full path information in these records

; modifications : ldiskfs (ldiskfs_mark_inode_dirty, ldiskfs_new_inode, ldiskfs_link, ldiskfs_unlink, ldiskfs_rmdir, ldiskfs_rename, ldiskfs_ioctl, ldiskfs_truncate, ldiskfs_xattr_set, ldiskfs_setattr)

=== Pathname Lookup ===

==== Uses relationships ====
; Used for : Data Replicator
; Requires : Parent Tracker, Changelog Accessor, ability to temporarily lock filesystem against namespace changes

==== Exported Interfaces ====
; source_path(ino, *sourcepath, *end_rec)
: ino : inode or fid from data changelog record
: sourcepath : current path on source, as of end_rec
: end_rec : latest namespace record number
* may return ENOENT: file has been deleted from source

; target_path(sourcepath, end_rec, namespace_log, *targetpath)
: namespace_log : the name of the namespace log that we have last synced on the target
: targetpath : the path for this inode on the target system, as of the end of namespace_log
* may return ENOENT: file does not exist (yet) on target

==== Features and Limitations ====
* Lives on MDT, exported interface to Lustre clients

==== Logic ====
source_path
# lock filesystem against renames
# lookup current full path name on MDT from ino/fid
#* open inode. Return ENOENT if doesn't exist
#* open parent ino's as stored in EA
#* likely that path elements will not be cached on the MDT during this lookup
# return full path in sourcepath
# return last record number in the active namespace changelog in end_rec
# unlock filesystem against renames

target_path
# generate a list of parent path elements from sourcepath (names or inodes)
# search backward through the namespace logs (active and staged) from end_rec, replacing any renamed path elements with their old (previous) version, until either:
## we reach the first record after end of the given namespace_log
## a create record for this inode is encountered; return ENOENT
# return targetpath

; Hardlinks
: source_path() will always return a single path. The path need not be consistent. target_path() will always return a single valid path on the target to one of the hardlinks (renames along this "chosen" path will be undone.) Renames along any other hardlinked paths may be ignored: target_path is used to update ''file data and attributes'', which are shared between all the hardlinked files on the target. The renames of the other hardlinked paths themselves are synchronized by the Namespace Synchronizer.

=== Changelog Notifier ===

==== Uses relationships ====
; Used for : The disk file system, to notify the Changelog Consumer of available data
; Requires : Changelog Generator

==== Exported Interfaces ====

* notification of new or excessive changelog
** updates sequence file

==== Features and Limitations ====

* When an active changelog has reached a preset limit of size, record count, or active time, the notifier:
*# increases the sequence number (changelog identifier)
*# creates a new active changelog with the new identifier
*# marks the old changelog as staged (unhidden).
* A Changelog Consumer could poll the sequence file for mtime changes, signalling a new staged file.

==== Logic ====
Multiple logs are kept on each disk file system volume; zero or more non-active ("finished") logs and exactly 1 active log. Finished logs will be stored in a special directory (i.e. /.changelog). Staged logs are named with a monotonically increasing sequence number. The sequence number of the latest staged log is stored on disk in a separate file (/.changelog/sequence).

# Begin recording into the active log. We append to this file until some trigger criteria is met (either time or log size). The active log is not visible to filesystem users.
# When the trigger criteria is met
## stop writing into the active log
## create a new active log
## thenceforth record all subsequent changes into the new active log
## indicate that the old log is now staged by marking it as visible and updating the sequence file
# When a user process has finished processing the files referred to in the staged log (including synchronizing the files remotely), it signals its completion by deleting the staged log (which may act as part of a new trigger criteria.)
# The cycle begins again from step 2.
# If the number of staged logs exceeds some threshold, the Notifier records a warning in the syslog (D_WARNING)

A user process is signaled that a new changelog is ready by polling the sequence file mtime.

If the user process dies, upon restart it re-reads the staged log, perhaps repeating old actions (sync). If the server dies, upon restart it continues recording into the active log.

; modifications : ldiskfs

=== Changelog Consumer ===

[[Image:sync.png|500px]]

==== Uses relationships ====
; Used for : Managing filesystem replication
; Requires : Namespace Synchronizer, File Data & Attribute Synchronizer, Changelog Notifier

==== Exported Interfaces ====
; replicate(lustre_root_path, target_root_path, [implicit target sync file])
: lustre_root_path : path to root of Lustre source fs
: target_root_path : path to root of target (must be locally mounted!)

The last namespace log record processed is stored in a sync file on the target. This file is read at the beginning of replicate() to determine the next start record.

==== Features and Limitations ====

The consumer uses the changelogs to coordinate filesystem replication using the Namespace Synchronizer and the File Data Synchronizer described below.

* Multiple replica filesystems may exist at different sync points.

==== Logic ====
A synchronization cycle consists of synchronizing the namespace and file data for each changed file/dir. Special care is needed because a file/dir may be renamed at any time during the synchronization cycle, changing the path name resolution.

Synchronization for Lustre requires both OST and MDT logs, with path lookup on the MDT for any files changed on the OSTs.

# Synchronize the namespace change operations from a staged log using the Namespace Synchronizer (MDT)
#* Now destination namespace mirrors source, but file contents/md may not match
# For all other ops (md and data) in the staged log(s on the MDT and all OSTs), call the Data/MD Synchronizer
# Data/MD synchronization should be carried out in parallel for each OST and the MDT.
#* Now destination filesystem namespace matches, but data/md may be newer than a theoretical snapshot of the source taken at Namespace Sync time.

; modifications : new userspace utility

=== File Data & Attribute Synchronizer ===

==== Uses relationships ====
; Used for : Changelog Consumer - data replication
; Requires : Changelog Accessor, Path Lookup, VFS api to open file by ino/fid, remote filesystem access

==== Features and Limitations ====
* Multiple data change logs exist and these can be synchronized in parallel.
* Individual changes are NOT recorded; only the fact that an inode is dirty. The data and metadata of the file on the target filesystem will match the source '''at the time of the copy''', not as of the namespace sync. In other words, the target is '''not''' an exact snapshot of the source filesystem at a single point.
* Open-by-fid requires root privileges

==== Logic ====
; '''Find pathnames in the destination''' : Data records are recorded with inode/fids on the source. To transform this into an operation that can be applied to the destination file system we find the target file name using:
# read data record
# source_path(ino, *sourcepath, *end_rec)
#* if ENOENT: file has been deleted from source; don't synchronize
# target_path(sourcepath, end_rec, namespace_change_log, *targetpath)
#* if ENOENT: file does not exist (yet) on target; don't synchronize

; '''Synchronize file data and attributes''' :
# iopen(ino) on source for reading
#* requires root privileges on the source
# open(targetpath) on target for writing
# copy file data and attributes from source to target
#* may require temporary access permission changes on the target
#* ownership changes may require root privileges on the target
#* data copy may be predicated on mtime change, checksum, etc. May use rsync as a primitive here.
#* data changes should be made before attribute changes (mtime), in case of power failure

; '''Parallel file processing'''
: if multiple OSTs (and/or the MDT) all note that a file needs to be synced, they may race. Normal locking should insure no inconsistent results, but copying the same file multiple times should be avoided for efficiency. Simple rules like "don't copy if target mtime is later than source mtime" will help, but the open(target)'s may still race after such checks. Possible solutions include using a lockfile on the target (difficult to determine if lockfile is still valid after a crash bec. possible distributed lockers), or interprocess communications on the sources, or perhaps a special "sync_in_progress" bit on the MDT inode set in the open-by-fid path. FIXME nonfatal, performance

; modifications : new userspace component (data syncer), ldiskfs/mdt (ioctl open by inode)

=== Namespace Synchronizer ===

==== Uses relationships ====
; Used for : Changelog Consumer - namespace replicator
; Requires : Changelog Accessor, remote filesystem access

==== Logic ====

; '''Path lookup at namespace change time''' : Records are recorded with inode/fids and full path (up to the filesystem root) on the source at the time of the change. At this point the path is likely to be cached so this should be cheap. The old (mvFrom) and the new (mvTo) paths are recorded

; '''Implementing a record''' : The metadata synchronizer processes all namespace change records in the changelog such that target filesystem namespace matches the source namespace as of the close of this changelog, including the mtime of the affected directories.

; '''Finding the first valid record to implement''' : The synchronizer performs an ordered write on the target filesystem. This records the record number (transno) of the namespace record that it is currently working on in a special file on the target, as well as a full copy of the transaction record, before it makes the namespace changes to the target. This leads to a re-do operation that is understandable:
;* a deletion is repeated if the name is still there. The mtime is updated if the name is not there and the mtime does not match the post-mtime in the record.
;* a creation is repeated if the name is not yet there.
;* a rename is repeated if the names or parent mtimes are not what they should be.
;* the full transaction record on the replica is used for recovery in case of master-to-replica failover.

; modifications : new userspace utility

== Solution Limitations ==

Notable limitations to the proposed solution:

A. A replicated file system is not a snapshot taken at a single point in time on a live filesystem. It is a namespace snapshot, combined with data and metadata contents that may be '''later than''' the snapshot time. In the case of deleted files, we may miss some data/md updates.

B. Data and attribute synchronization may require temporary access permission changes on the target.

C. Open-by-inode may require root permission on source

D. A new record is not created for every data/md change; at most one per file per epoch is recorded. (However, every namespace change is recorded). This may have implications for potential audit, undo future features.

== States and Transitions ==

=== Source File System ===

* A directory for changelog files (per MDT/OST)
* Staged and active changelogs
** header: start record, start mtime
** records: transno, ino, parent ino, mtime (see Changelog Format above)
** footer: final record, final mtime, (future) ZFS snapshot name associated with the final state
* Sequence file
** active changelog filename

==== Changelog Format ====
Changelogs are packed binary data files. The first record is a header; subsequent records use the following structure:
* magic (_u32)
* flags (_u32)
* record_num (__u64) (transno)
* record_type(_u16)
* ino(long)
* parent_ino(long)
* namespace changes only:
** old mtime (struct timespec)
**: note: post-op mtime is in the next namespace change rec, or current mtime
** strlen(_u16)
** pathname(string)
**: note: for renames, "from" is in the mvFrom rec, "to" is in the mvTo rec

; record_types : Create, Link, Unlink, Rmdir, mvFrom, mvTo, Ioctl, Trunc, Xattr, Setattr, *unknown

The header record:
* magic (_u32)
* flags (_u32)
* start_record_num (__u64)
* changelog_file_seqno(long)
* log start time (struct timespec)
* strlen(_u16)
* note(string)

Tail record:
* magic (_u32)
* flags (_u32)
* end_record_num (__u64)
* next_changelog_seqno(long)
* log end time (struct timespec)
* strlen(_u16)
* snapshot name(string) (future)

=== Target file system ===

* Last attempted namespace record (entire record)
* Last attempted data / metadata records for each replicator
*: optional, allows for faster recovery
** OST index
** replication start
** replication last completed

=== State Changes in the Source File System ===
* Some file system transactions result in changelog records recorded in the active changelog
** file data or metadata modifications in files that have not already been added to the active changelog
** any namespace change operations
* Active changelogs are occasionally staged. Multiple staged logs may exist.
* The sequence file is updated to reflect the latest staged log.
* "Completed" changelogs may be deleted at will (after synchronization)

[[image:source_state.png]]

=== State Changes in the Target File System ===

* Each time a namespace operation is attempted the last attempted record is recorded. This may also be done for attribute and file synchronization operations, for efficient restarts (note that this would require a separate file per replication 'chunk').

* Namespace operations are recorded on the target filesystem using user-level filesystem commands (mv, rm, mknod, touch) or the posix file API (open, creat, unlink, ioctl) to adjust file, directory, and parent directory names and metadata.

* Data / metadata operations are recorded on the target filesystem using user-level filesystem commands (cp, rsync, touch) or the posix file API (read, write, ioctl) to synchronize file content and metadata

== Related Applications ==

The mechanisms to use a list containing fids of modified files for synchronization hinges on opening files by fid or correctly computing the pathname on a previously synced replica. This mechanism has applications elsewhere in Lustre which we discuss here.

=== Flash Cache Sync demo ===

[[Image:flash-cache-sync-demo.png|700px]]

=== Fast Incremental Backup ===

[[Image:incremental-backup.png|700px]]

=== Space Balancing Migration ===

[[Image:space-balancing-migration.png|700px]]

=== Hadoop File System style Server Network Stripes ===

[[Image:Hdfs-sns.png|700px]]

== Alternatives ==
=== mtime on MDT ===
A always-current mtime on the MDT allows us to record changed files on the MDT only (no OST logging). Bug 11063 has an update-mtime-on-close patch, but this is unsafe in case of client crash. We could use full SOM-style recovery to guarantee mtime (backport from HEAD), but this may be very complicated given MDT changes. Arguably the changelog we are creating on the OSS nodes is similar to the recovery mechanisms used by SOM.

NOTE: on HEAD (with SOM), we need only record file (metadata) changes on the MDT, since every OST change will result in an updated mtime on the MDT inode. This eliminates the logging requirement for replication on the OSTs, removing possible racing OST syncs. The synchronizer on the MDT would be responsible for distributing chunks of the data synchronization to multiple clients and insuring their completion.

=== ZFS snapshots ===
By taking a snapshot on the MDT at the end of each epoch, the target pathname reconstruction step can be avoided. "Current" pathname of the inode is looked up on the snapshot; this gives the target pathname directly.

Potentially, if cluster-wide synchronized snapshots were available, then true snapshot backups could be made.

== References ==

[https://bugzilla.lustre.org/show_bug.cgi?id=14169 bug 14169]
[[Category:Feature|FS Replication]]
[[Category:Team_Rabbit]]

== to be removed: HLD reqs ==
Entry Criteria

You need to have on hand
1. Architecture document
2. Quality Attribute Scenarios & Use cases
3. LOGT, LOGD

1. External Functional specifications
Following the architecture, define prototypes for all externally
visible interfaces (library functions, methods etc) of all
modules changed by the program. Be sufficiently detailed in the
specification to cover:
a. Layering of API's
b. How memory for variables / parameters is allocated
c. In what context the functions run

2. High level logic
Use high level pseudocode to indicate how all elements of the
program will be implemented.

3. Use case scenarios
a. Write use cases for all normal and abnormal uses of externally
visible functions.
b. Write use cases demonstrating interoperability between the
software with and without this module
c. Write use cases demonstrating the scalability use cases
mentioned in the architecture.
d. Include use case scenarios for all locking situations and
describe how likely they are.

4. State machine design
With great care describe state machines included, used or affected
by the module. Describe the transitions between states. Be
alert to the fact that any function called by the module can
change a state machine in the environment and this may interact
with state machines in your model.

Pay particular attention to:
a. locking (ordering, lock/unlock)
b. cache usage
c. recovery (connect, disconnect, export/import/request state
machines)
d. disk state changes

5. High Level Logic Design
a What synchronization primitives (lock types etc) need to be chosen
to handle locking use cases most efficiently.

6. Test plan
Give a high level design of all tests and describe test cases to
cover/verify all critical use cases and quality attribute scenarios

7. Plan review
Review the estimates in the TSP cycle plan based on the information
obtained during the work on the HLD. Bring wrong estimates to the
attention of the planning manager in the weekly meeting.

8. EXIT CRITERIA
A well formed design with prototypes, state machines, use cases,
logic.

Architecture - Changelogs 1.6

2008-03-12T19:11:33Z

Nathan:

= Replication Changelogs for Lustre 1.6 =

Intended as a short-term feature to facilitate efficient replication of large Lustre 1.6 filesystems.

== Requirements ==
1. Generate a list of modified files for syncronization via an external tool
2. List must by complete, but may be conservative.
3. List must be persistent, and may be repeated on failover.
4. After original filesystem quiesces, replica must become an exact mirror

== List generation ==

== References ==

[https://bugzilla.lustre.org/show_bug.cgi?id=14169 bug 14169]
[[Category:Architecture|Changelogs]]
[[Category:Architecture|Replication]]
[[Category:QAS|Changelogs]]
[[Category:Team_Rabbit]]

Architecture - Fileset

2007-12-14T18:56:03Z

Nathan: add multiple filesets

== Summary ==

A user application (or Lustre internal features) may want to perform an action on a very large set of files. Such actions might include migration to slower storage, purging of old files, or replication to a proxy server. A '''fileset''' is an efficient representation of these file identifiers (fids).

The definition of any particular fileset is left to an external agent; no search features will be included in Lustre itself (excluding Least Recently Used files, which is probably only efficiently tracked within Lustre). Typically searches for files with particular metadata characteristics will be done a database that mirrors the Lustre file tree via a [http://arch.lustre.org/index.php?title=Server_Changelogs ChangeLog]. The files matching these criteria will be added to a fileset via a Lustre fileset API.

Filesets will generally come in two flavors: arbitrary collections of files, or a full file tree. See Enumeration below.

== Definitions ==

; '''Fileset''' : an arbitrary subset of files from within a single filesystem's namespace.
; '''Consumer''' : an entity acting on the contents of a fileset
; '''Internal consumer''' : a Lustre internal feature using a fileset (e.g. fileset client mount, maybe replicator, migrator)
; '''External consumer''' : an entity external to Lustre using a fileset. This may be limited to a user of a fileset client mount, and no access to any other representation of a fileset is needed. see Client Access below.
; '''type''' : fileset type, see Enumeration below

== Qualities ==

{|cellspacing="0" border="1"
!Description
!Quality
!Semantics
|-
|coherence||usability||file modifications are reflected in the fileset (e.g. unlink, rename)
|-
|permanence||usability, scalability||when filesets are discarded.
|-
|synchronization||usability||the list of files in the set may change.
|-
|physiology||scalability||internal representation must be used efficiently
|-
|hashing||scalability||actions on a fileset may need to be distributed across multiple servers for scalability
|-
|modification||usability||the contents of a fileset may be modified over time to add or remove items
|}

== Use Cases ==

{| border=1 cellspacing=0
|-
!id !! quality attribute !! summary
|-
|compliance || usability, scalability || delete all files modified in 2002
|-
|workset || availability || the files in the fileset are available on a remote proxy server
|-
|backup || scalability || filesystem must be subdivided into manageable chunks for backup / replication
|}

;compliance
{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || delete all files modified in 2002
|-align="left"
|colspan=2|'''Business Goals:''' || Provide an API to facilitate filesystem operations based on database search output
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Usability, scalability
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| The fileset and requested operation are fed to the API
|-align="left"
|'''Stimulus source:'''|| Compliance policy dictates removal of old files
|-align="left"
|'''Environment:'''|| Database has recent FS information (from watching a [http://arch.lustre.org/index.php?title=Server_Changelogs ChangeLog])
|-align="left"
|'''Artifact:'''|| Fileset, type 1
|-align="left"
|'''Response:'''|| Lustre performs the requested operation on each of the files in the fileset
|-align="left"
|'''Response measure:'''|| fileset is created, operation is completed on all elements of the fileset
|-align="left"
|colspan=2|'''Questions:'''|| Are all operations executed from userspace on a client (external), or some directly on Lustre via an API?
|-align="left"
|colspan=2|'''Issues:'''||
|-
|}

;workset
{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || all files with the words "bunny rabbit" are replicated at a dozen remote analysis clusters
|-align="left"
|colspan=2|'''Business Goals:''' || Provide current access to dynamic set of files on a proxy server
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| Search results are fed to the API
|-align="left"
|'''Stimulus source:'''|| External search or project directory
|-align="left"
|'''Environment:'''|| Database has recent FS information (e.g. from watching a ChangeLog)
|-align="left"
|'''Artifact:'''|| Fileset, type 1 or type 2
|-align="left"
|'''Response:'''|| Lustre creates an internal representation of the fileset and makes it available for export.
|-align="left"
|'''Response measure:'''|| Fileset is created
|-align="left"
|colspan=2|'''Questions:'''|| Is a small time lag acceptable, or must proxies / filesets be absolutely synchronous
|-align="left"
|colspan=2|'''Issues:'''||
|-
|}

;backup
{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' ||filesystem must be subdivided into manageable chunks for backup / replication
|-align="left"
|colspan=2|'''Business Goals:''' || User requires particular backup policies on particular sets of files
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Feature, Scalability
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''|| External app reads all files in a fileset
|-align="left"
|'''Stimulus source:'''|| External HSM or backup application
|-align="left"
|'''Environment:'''|| Client access to a limited, defined list of files
|-align="left"
|'''Artifact:'''|| Fileset, type 2
|-align="left"
|'''Response:'''|| All files in fileset are backed up
|-align="left"
|'''Response measure:'''|| Backup time, minor filesystem load during backup
|-align="left"
|colspan=2|'''Questions:'''||Subdivision of migration work seems like it should be handled by migration architecture; doesn't seem to really have anything to do with filesets
|-align="left"
|colspan=2|'''Issues:'''||
|-
|}

== Requirements ==
=== Dynamic ===
Search results may be returned slowly, or new files that meet the search criteria may be added to the filesystem. In those cases, it should be possible to add (or remove) items to an existing fileset. The fileset should in turn notify consumers of the fileset.
Alternately, some filesets may be defined to be static.

=== Persistence ===
The workset case implies a fileset must be persistent across server / client reboots.

=== Specification ===
It may be desirable for a remote site to specify a fileset that should be locally proxied (i.e. pull instead of push). A fileset ''name'' is probably useful for this. (e.g. a client requests mirror fileset 'bunnyrabbit' on local proxy servers)

=== Coherence ===
Files referenced in the fileset must be coherent with the original file. E.g. if a file referenced by a fileset is moved, the fileset should reflect the new file location. If a file in a fileset is deleted, the file should disappear from the fileset. Maybe this can be achieved by having the fileset take appropriate '''locks''' on the original files.

Coherence requirements:
- unlink
- rename
- move to a new directory
- file metadata (access time, perms, owner, etc.)

Note that if changing the above would cause a file to no longer meet the original search criteria that generated that fileset, it is up to the search generator to (eventually) remove it from the fileset. There are two exception to this rule, where the file should be removed from the fileset automatically:
;1. unlink
;2. move of a file included by virtue of its location in a file tree to a location outside of that tree (see Enumeration below)

=== Fileset as Object ===
Depending on the intended use, some filesets may be represented more efficiently than others, or may require different descriptors or methods. Implementing filesets as objects with variable attributes and methods may provide broad but efficient coverage of the range of uses. For example, one common type of fileset may be "a user's home directory", which could be efficiently represented as a single directory fid.

=== Hashing ===
When performing an action on large filesets or large numbers of filesets, we must be able to distribute load across multiple servers to insure performant operation. This is true for internal consumers, but perhaps this function should be offloaded to a distributed application for external consumers.

For example, 10,000 filesets are to be replicated independently. A changelog per fileset may not scale well, and instead we may need a scalable algorithm to find the results for each fileset from a global changelog.

=== ChangeLog ===
It may be useful to have a per-fileset [http://arch.lustre.org/index.php?title=Server_Changelogs changelog] maintained for audit or replication purposes. A fileset-specific changelog could be used to provide migration/replication-related events specific to the fileset to migration agents. The agents would then use this information e.g. to abort / commence copying a file.

However, maintaining a per-fileset changelog may not scale. At some point, it make make more sense to process a common global changelog.

=== Multiple Membership ===
A file may be part of multiple filesets. A type 2 fileset may implicitly include other type 2 filesets. Operations on a file should affect all filesets it belongs to, and vice-versa.

=== Fileset API ===
The user API for filesets should include the following functionality:
- Start a new fileset
- Add items to a fileset
- Remove items from a fileset
- Delete a fileset
- Initiate activity of an internal consumer (e.g. migrate fileset bunny from poolA to poolB)
- Provide client access to a fileset (see Client Access below)

== Implementation Notes ==

=== Enumeration ===
Fileset enumeration should be handled in two ways:
;Type 1. An explicit enumeration of files or directories. Files within directories are ''not'' included in the fileset unless explicitly listed as well.
;Type 2. Inclusive file trees. ''All'' files / subdirectories below enumerated directories are included in the fileset.

We should have provision for using both types of filesets. In fact, with some per-entry flags, we can define "mixed" filesets including both of the above (each entry in a fileset may be type 1 (flat=single file) or type 2 (tree). Perhaps a 3rd type; "not_included" would be a useful definition as well, to specifically exclude a particular subdirectory from a type 2 fileset.

=== Storage ===
Permanent fileset definitions would probably be stored on the MDT (as opposed to the MGS) for scalability and namespace-related locking.

=== UI ===
==== Maintenance ====
The UI for maintaining filesets might reasonably be run through lfs similar to
[http://arch.lustre.org/index.php?title=Pools_of_targets pools]:
# lfs fileset_new <fileset name> Define a new fileset
# lfs fileset_add <fileset name> <options> <filename1> <filename2> ... Add the named files to the fileset; define type 1 or type 2
# lfs fileset_remove <fileset name> <filename> Remove the named file from the fileset
# lfs fileset_destroy <fileset name> Remove the definition of the fileset

==== Client Access ====
For arbitrary user access to the files in a fileset, a mechanism like '''mount(8)''' seems like it would provide a clear, simple way to retrieve a fileset. (Command format might be "mount -t lustre mgs://fsname/fileset mntpt")

For type 1 filesets, a hierarchical namespace defined by the files and directories in the fileset would be constructed locally. Directories would all be ''read/execute-only''; a client cannot add new entries into the fileset by creating files in the fileset hierarchy. Regular files would keep their normal access permissions.

For type 2 filesets, the mount point would act exactly like a subtree of the full lustre fs.

== References ==
; [https://bugzilla.lustre.org/show_bug.cgi?id=14168 bug 14168]
; [http://arch.lustre.org/index.php?title=Server_Changelogs server changelogs]
[[Category:Architecture|Fileset]]
[[Category:Team_Rabbit]]

Architecture - Adaptive Timeouts - Use Cases

2007-08-27T22:53:01Z

Nathan:

= Adaptive Timeouts =
== Terminology ==

{|border=1 cellspacing="0"
|-
|'''Adaptive Timeout (AT)'''||Network RPC timeouts based on server and network loading.
|-
|'''Early reply'''||A response sent immediately by the server informing the client of a predicted slower-than-expected final response.
|-
|'''Service estimate'''||The expected worst-case time a request on a given portal to a given service will take. This value changes depending on server loading.
|-
|}

== Architecture ==

{| border=1 cellspacing=0
|-
|A. Report service times :|| Replies are annotated with the RPC's service time and the service estimate. The service estimate is updated upon every 'success' reply. Clients use the measured round-trip time and the reported service time to determine the round-trip network latency. Clients use the service estimate and network latency to set the timeout for future RPCs.
|-
|B. Early replies :|| Servers compare the timeout encoded in the RPC with the current service estimate and send early replies to all queued RPCs that they expect to time out before being serviced, reporting the new service estimate. Clients receiving early replies adjust the RPCs local timeout to reflect the new service estimate. This process is repeated whenever an RPC is near its deadline.
|-
|}

== Use cases ==

=== Summary ===

{| border=1 cellspacing=0
|-
!id !! quality attribute !! summary
|-
|congested_server_new || recovery || server congestion causes client timeouts to adapt, not fail (new rpc)
|-
|congested_server_pending || recovery || server congestion causes client timeouts to adapt, not fail (pending or in-progress rpcs)
|-
|timeouts_recover || performance || timeouts decrease when load decreases
|-
|new_clients_learn || performance || new clients learn about server/network timing immediately
|-
|lock_timeouts || recovery || lock timeouts are based only on lock processing history, per target
|-
|busy_client_not_evicted || recovery || a responding client is not evicted for failing to return a lock quickly
|-
|server_down_client_start || performance || client starts while server is down.
|-
|liblustre_client_joins_late || recovery || a liblustre client computes for 20 min, then discovers the server has rebooted.
|-
|client_collection_timeout || recovery || Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time.
|-
|replay_timeout || recovery || Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses.
|-
|communications_failure || availability, performance || Lustre handling of communications failures
|-
|redundant_router_failure || availability, performance || Lustre handling of redundant router failure
|}

=== server_down_client_start ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || New client starts or restarts while a server is down or unresponsive
|-align="left"
|colspan=2|'''Business Goals:''' ||Maximize performance
|-align="left"
|colspan=2|'''Relevant QA's:'''||Performance
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''||New client tries to connect to an unresponsive server
|-align="left"
|'''Stimulus source:'''||client (re)boot
|-align="left"
|'''Environment:'''||server or network failures
|-align="left"
|'''Artifact:'''||Client connect time
|-align="left"
|'''Response:'''||After the obd_connect RPC request timeout, a new connect is attempted, on either the same network or a different network.
|-align="left"
|'''Response measure:'''||Time to successfully connect once the server/network becomes available.
|-align="left"
|colspan=2|'''Questions:'''||None
|-align="left"
|colspan=2|'''Issues:'''||We want obd_connect attempts on different connections (networks) to happen quickly, so that we try alternate routes in the case that one network fails. But we want attempts on the same network to happen at LND-timeout speed (e.g. 50s), so we don't just pile up PUTs that are stuck. But the problem is there is only a single RPC timeout number that we can specify, and it doesn't know what connection the next attempt will be on (decided in import_select_connection). Perhaps we need to do this: if we are at the end of the imp_conn_list, set the connect rpc timeout to slow (i.e. max(50s, adaptive timeout)), otherwise set the timeout fast (adaptive timeout).
|-
|}

=== liblustre_client_joins_late ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || A liblustre client computes for 20 min, then discovers the server has rebooted.
|-align="left"
|colspan=2|'''Business Goals:''' ||Minimize evictions
|-align="left"
|colspan=2|'''Relevant QA's:'''||Recovery
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''||Liblustre doesn't utilize pinger to verify server availability; attempts reconnection only when application tries to write.
|-align="left"
|'''Stimulus source:'''||Actively computing client
|-align="left"
|'''Environment:'''||server failures
|-align="left"
|'''Artifact:'''||recoverable state in the cluster
|-align="left"
|'''Response:'''||eviction of client determined by version recovery
|-align="left"
|'''Response measure:'''||Availability
|-align="left"
|colspan=2|'''Questions:'''||None
|-align="left"
|colspan=2|'''Issues:'''||Version recovery makes it possible to rejoin after recovery period in some conditions; unrelated to AT.
|-
|}

=== client_collection_timeout ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || Heavily loaded server fails over; clients have long AT already, and so don't try to reconnect for a long time.
|-align="left"
|colspan=2|'''Business Goals:''' ||Minimize evictions
|-align="left"
|colspan=2|'''Relevant QA's:'''||Recovery
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
| '''Stimulus:'''||Client with a long AT waits a long time to try to reconnect to a rebooting server. The server has no idea how long to wait for the (first) client during the recovery client collection phase.
|-align="left"
|'''Stimulus source:'''||Client reconnect attempt
|-align="left"
|'''Environment:'''||slow server / network conditons, then server failover
|-align="left"
|'''Artifact:'''||recoverable state in the cluster
|-align="left"
|'''Response:'''||eviction of client(s) if client collection timeout is too short
|-align="left"
|'''Response measure:'''||Availability
|-align="left"
|colspan=2|'''Questions:'''||None
|-align="left"
|colspan=2|'''Issues:'''||Because a newly rebooted server has no idea of previous network/server load, a fixed timeout must be used when waiting for the first client to reconnect during the client collection phase. Once the first client has reconnected, the server can keep track of the maximum expected AT as reported by the client in the connect RPC. This information can then be used to adjust how much longer the server will wait for client collection to complete.
|-
|}

=== replay_timeout ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || Client replaying lost reqs after server failover must wait for the server's recovery client collection phase to complete before they will see responses.
|-align="left"
|colspan=2|'''Business Goals:''' ||Minimize recovery time
|-align="left"
|colspan=2|'''Relevant QA's:'''||Performance, Recovery
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
|'''Stimulus:'''||Client replay request
|-align="left"
|'''Stimulus source:'''||Server failover
|-align="left"
|'''Environment:'''||server failure
|-align="left"
|'''Artifact:'''||replay request timeout
|-align="left"
|'''Response:'''||When a client tries to reconnect after failover, the replay req timeout is set to the AT expected processing time plus the fixed client collection timeout.
|-align="left"
|'''Response measure:'''||Performance
|-align="left"
|colspan=2|'''Questions:'''||Is it possible for version recovery to start recovery of certain files before the client collection period has finished?
|-align="left"
|colspan=2|'''Issues:'''||Could have c1, recov period, c2, recov period, c3, in which case c1 would time out and resend the replay request until either c3 joins or recovery is aborted because it never showed up.
It would of course be desirable for c1, c2 to be able to start recovery before waiting a long time for c3 to arrive, to avoid wasting all that time.

Maybe version based recovery can help us in this case, so that c1, c2 know
the files they are operating on have the same preop version and it is safe
for them to begin recovery without waiting for c3. Then, either c1, c2
will have completed recovery and wait recov_period before normal recovery
is closed, or c3 will join in time if there are dependencies.
|-
|}

=== communications_failure ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || Lustre handling of communications failures
|-align="left"
|colspan=2|'''Business Goals:''' || Fault isolation
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability, performance
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
|'''Stimulus:'''|| Peer node crash / hang / reboot or network failure
|-align="left"
|'''Stimulus source:'''|| Hardware/software failure. Sysadmin
|-align="left"
|'''Environment:'''|| Operation under all loads
|-align="left"
|'''Artifact:'''|| ptlrpc (both client and server), OST/OSC, MDC/MDS
|-align="left"
|'''Response:'''|| Failed RPCs are completed cleanly. The number of uncompleted RPCs to any peer is bounded.
|-align="left"
|'''Response measure:'''|| No effect on communications other peers.
|-align="left"
|colspan=2|'''Questions:'''|| None
|-align="left"
|colspan=2|'''Issues:'''||

An RPC, whether on the client or the server is a compound communication (e.g. request, bulk, reply). Each
individual communication has an associated buffer which may be active (posted with LNetPut() or LNetGet()) or
passive (posted with LNetMEAttach()). An RPC is only complete when the MDs for all its successfully posted
communications have been unlinked. LNET may access these buffers at any time until the "unlinked" event has
been received. RPCs must be completed in all circumstances, even if it is to be abandoned because of a
timeout or a component communication failed.

Uncompleted RPCs consume resources in LNET, the relevant LND and its underlying network stack. Further RPCs
may fail if the number of uncompleted RPCs is allowed to grow without limit - for example if all network
resources are tied up waiting for RPCs to a failed peer, new RPCs to working, responsive peers may fail.

Calling LNetMDUnlink() on a posted MD ensures delivery of the "unlink" event at the earliest opportunity. The
time to deliver this event is guaranteed finite, but may be determined by the underlying network stack. Note
the fundamental race with normal completion - LNET handles this so that it is safe to call at any time or
indeed any number of times, however it only returns success on the first call and only if the MD hasn't
auto-unlinked already.

|}

=== redundant_router_failure ===

{|border=1 cellspacing="0"
|-align="left"
|colspan=2|'''Scenario:''' || Lustre handling of redundant LNET router failure
|-align="left"
|colspan=2|'''Business Goals:''' || Transparent handling of the failure
|-align="left"
|colspan=2|'''Relevant QA's:'''|| Availability, performance
|-align="left"
|rowspan="6" writing-mode="vertical"|'''details'''
|'''Stimulus:'''|| Redundant router crashes or hangs
|-align="left"
|'''Stimulus source:'''|| hardware/software failure or sysadmin reboot
|-align="left"
|'''Environment:'''|| operation under all loads with LNET router failure detection enabled
|-align="left"
|'''Artifact:'''|| ptlrpc (both client and server), OST/OSC, MDC/MDS
|-align="left"
|'''Response:'''|| Router fails transparently to applications using the file system. Minimum performance impact.
|-align="left"
|'''Response measure:'''|| No errors returned to the application. Performance impact no worse than server failover.
|-align="left"
|colspan=2|'''Questions:'''|| None
|-align="left"
|colspan=2|'''Issues:'''||

When a router fails, many communications, either buffered in the router or committed to be routed via the
router, will fail. This fails all relevant RPCs, however further intermittent RPC failure is possible until
all nodes on all paths from client to server and back have detected and avoided the failed router.

|}

== References ==

The parent tracking bug for Adaptive Timeouts is 3055.

[[Category:QAS|Recovery Failures]]

Help:Editing

2007-05-11T22:24:59Z

Nathan:

see Mediawiki [http://meta.wikimedia.org/wiki/Help:Editing editing help]
and [http://meta.wikimedia.org/wiki/Help:Wikitext_examples examples]

Help:Editing

2007-05-11T22:22:53Z

Nathan:

see [http://meta.wikimedia.org/wiki/Help:Editing Mediawiki]

Patchless Client

2007-05-08T14:45:29Z

Nathan: /* Versions */

== Patchless Client ==
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons.

We will typically post a "patchless" RPM at the [http://www.lustre.org/downloads.html download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.
{{{
[lustre]$ ./configure --with-linux=/unpatched/kernel/source
}}}

=== Versions ===
Currently, the patchless client works with these kernel versions

Vanilla kernel:
* 2.6.15 (1.6.0)
* 2.6.16 (1.6.0)
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.
* 2.6.18 (1.6.0)
* 2.6.19 (1.6.0)
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Red Hat Enterprise Linux:
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15
kernels and earlier, nested symlinks will lead to unpredictable results
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to
write/truncate a binary simultaneously while a user from a different client
executes the same binary
* RHEL4U5 [2.6.9-55EL] (1.6.0) Red Hat has includesd a Lustre-specific patch
with RHEL4U5 which resolves the above issues.

* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Fedora Core:
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Suse:
* SLES 10 (tbd)

=== Known Issues ===

many NFS-related bugs are also addressed by the patchless client fixes.

Patchless Client

2007-05-07T17:08:52Z

Nathan: /* Versions */

== Patchless Client ==
As of Lustre 1.6.0, Lustre supports running the client modules on some unpatched "stock" kernels.
This results in some small performance losses, but may be worthwhile to some users for maintenance or contract reasons.

We will typically post a "patchless" RPM at the [http://www.lustre.org/downloads.html download site]. Instead, if building from source, the Lustre configure script will automatically detect the unpatched kernel and disable building the servers.
{{{
[lustre]$ ./configure --with-linux=/unpatched/kernel/source
}}}

=== Versions ===
Currently, the patchless client works with these kernel versions

Vanilla kernel:
* 2.6.15 (1.6.0)
* 2.6.16 (1.6.0)
* 2.6.17 (1.6.0) Mandriva's 2.6.17 is also reported working.
* 2.6.18 (1.6.0)
* 2.6.19 (1.6.0)
* 2.6.20 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])
* 2.6.21 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Red Hat Enterprise Linux:
* RHEL4 [2.6.9-42.0.8EL] (1.6.0) with the following caveats:
- Nested Symlinks: due to improper lookup_continue logic with unpatched 2.6.15 kernels and earlier, nested symlinks will lead to unpredictable results
- FMODE_EXEC missing: Lustre will incorrectly allow a user from one client to write/truncate a binary simultaneously while a user from a different client executes the same binary
* RHEL4U5 (1.6.0) Red Hat has includesd a Lustre-specific patch with RHEL4U5 which resolves the above issues.

* RHEL5 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Fedora Core:
* FC6 (1.6.1 [https://bugzilla.lustre.org/show_bug.cgi?id=11647 bug 11647])

Suse:
* SLES 10 (tbd)

=== Known Issues ===

many NFS-related bugs are also addressed by the patchless client fixes.