Subsystem Map

libcfs
Summary	Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets.
Code	lustre/lnet/libcfs/*/.[ch]

lnet
Summary	LNET = the Lustre Networking subsystem. See the Lustre Networking white paper for details.
Code	lustre/lnet/*/.[ch]

ptlrpc

Summary

Ptlrpc implements Lustre communications over LNET.

All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.

The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.

Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

Code

lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

llog

Summary

Overview

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.

LLog is used is various cases. The main LLog use cases are the following:

mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;

MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;

Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);

LOVEA llog - joins the file LOV EA merge log.

General design

Each llog type has two main parts:

ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;

REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

Code

obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c

obdclass

Summary

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.

Several examples of obd devices include:

OSC - object storage client (connects over network to OST)
OST - object storage target
LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
MDC - meta data client (connects over network to MDT)
MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

Code

lustre/obdclass/class_hash.c     - scalable hash code for imports
lustre/obdclass/class_obd.c      - base device handling code
lustre/obdclass/debug.c          - helper routines for dumping data structs
lustre/obdclass/genops.c         - device allocation/configuration/connection
lustre/obdclass/linux-module.c   - linux kernel module handling
lustre/obdclass/linux-obdo.c     - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c   - /proc/sys configuration parameters
lustre/obdclass/lprocfs_status.c - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c    - peer target identification by UUID
lustre/obdclass/obd_config.c     - configuration file parsing
lustre/obdclass/obd_mount.c      - server filesystem mounting
lustre/obdclass/obdo.c           - more obdo handling helpers
lustre/obdclass/statfs_pack.c    - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c           - UUID pack/unpack
lustre/lvfs/lvfs_common.c        - kernel interface helpers
lustre/lvfs/lvfs_darwin.c        - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h      - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c           - statistics
lustre/lvfs/lvfs_linux.c         - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c        - userspace helper routines
lustre/lvfs/prng.c               - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c       - supplementary group upcall for MDS

luclass

Summary

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.

Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;

Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;

Objects are reference counted, and cached;

Every object has a list of layers (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.

luclass design is specified in the MD API DLD.

Code

include/lu_object.h
obdclass/lu_object.c

ldlm

Summary

The Lustre Distributed Lock Manager (LDLM) is the Lustre locking infrastructure; it handles locks between clients and servers and locks local to a node. Different kinds of locks are available with different properties. Also as a historic heritage, ldlm happens to have some of the generic connection service code (both server and client).

Code

interval_tree.c - this is used by extent locks to maintain interval trees (bug 11300).

l_lock.c - resourse locking primitives.

ldlm_extent.c - extents locking code used for locking regions inside objects.

ldlm_flock.c - bsd and posix locking lock types.

ldlm_inodebits.c - inodebis locks used for metadata locking.

ldlm_lib.c - target and client connecting/reconnecting/recovery code. Does not really belong to ldlm, but is historically placed there. Should be in ptlrpc instead.

ldlm_lock.c - this source file mostly has functions dealing with struct.

ldlm_lock ldlm_lockd.c - functions that imply replying to incoming lock-related rpcs (that could be both on server (lock enq/cancel/...) and client (ast handling)).

ldlm_plain.c - plain locks, predecessor to inodebits locks; not widely used now.

ldlm_pool.c - pools of locks, related to dynamic lrus and freeing locks on demand.

ldlm_request.c - collection of functions to work with locks based handles as opposed to lock structures themselves.

ldlm_resource.c - functions operating on namespaces and lock resources.

include/lustre_dlm.h - important defines and declarations for ldlm.

fids

Summary

FID is unique object identifier in cluster since 1.7. It has few properties, main of them are the following:

FID is unique and not reused object identifier;
FID is allocated by client inside of the sequence granted by server;
FID is base for ldlm resource used for issuing ldlm locks. This is because FID is unique and as such good for this using;
FID is base for building client side inode numbers as we can't use server inode+generation anymore, in CMD this is not unique combination;
FID does not contain store information like inode number or generation and as such easy to migrate;

FID consists of 3 fields:

f_seq - sequence number
f_oid - object identifier inside sequence
f_ver - object version

Code

fid/fid_request.c
fid/fid_lib.c
fld/*.[ch]

seq

Summary

Overview

Sequence management is a basic mechanism in new MDS server which is related to managing FIDs.

FID is an unique object identifier in Lustre starting from version 1.7. All FIDs are organized into sequences. One sequence is number of FIDs. Sequences are granted/allocated to clients by servers. FIDs are allocated by clients inside granted sequence. All FIDs inside one sequence live on same MDS server and as such are one "migration unit" and one "indexing unit", meaning that FLD (FIDs Location Database) indexes them all using one sequence and thus has only one mapping entry for all FIDs in sequence. Please read section devoted to FIDs bellow in the root table to find more info on FLD service and FIDs.

A sequence has the limit of FIDs to be allocated in it. When this limit is reached, new sequence is allocated. Upon disconnect, server allocates new sequence to the client when it comes back. Previously used sequence is abandoned even if it was not exhausted. Sequences are valuable resource but in the case of recovery, using new sequence makes things easier and also allows to group FIDs and objects by working sessions, new connection - new sequence.

Code description

Server side code is divided into two parts:

Sequence controller - allocates super-sequences, that is, sequences of sequences to all servers in cluster (currently only to MDSes as only they are new FIDs aware). Usually first MDS in cluster is sequence controller

Sequence manager - allocates meta-sequences (smaller range of sequences inside a super-sequence) to all clients, using granted super-sequence from the sequence controller. All MDSs in the cluster (all servers in the future) are sequence managers. The first MDS is, simultaneously, a sequence controller and a sequence manager.

Client side code allocates new sequences from granted meta-sequence. When meta-sequence is exhausted, new one is allocated on server and sent to the client.

Client code consists of API for working with both server side parts, not only with sequence manager as all servers need to talk to sequence controller, they also use client API for this.

One important part of client API is FIDs allocation. New FID is allocated in currently granted sequence until sequence is exhausted.

Code

fid/fid_handler.c - server side sequence management code;
fid/fid_request.c - client side sequence management code;
fid/fid_lib.c - fids related miscellaneous stuff.

mountconf

Summary

MountConf is how servers and clients are set up, started, and configured. A MountConf usage document is here.

The major subsystems are the MGS, MGC, and the userspace tools mount.lustre and mkfs.lustre.

The basic idea is:

Whenever any Lustre component is mount(2)ed, we start a MGC.
This establishes a connection to the MGS and downloads a configuration llog.
The MGC passes the configuration log through the parser to set up the other OBDs.
The MGC holds a CR configuration lock, which the MGS recalls whenever a live configuration change is made.

Code

MountConf file areas:

lustre/mgs/*
lustre/mgc/*
lustre/obdclass/obd_mount.c
lustre/utils/mount_lustre.c
lustre/utils/mkfs_lustre.c

liblustre
Summary	Liblustre is a userspace library, used along with libsysio (developed by Sandia), that allows Lustre usage just by linking (or ld_preload'ing) applications with it. Liblustre does not require any kernel support. It is also used on old Cray XT3 machines (and not so old, in the case of Sandia), where all applications are just linked with the library and loaded into memory as the only code to run. Liblustre does not support async operations of any kind due to a lack of interrupts and other notifiers from lower levels to Lustre. Liblustre includes another set of LNDs that are able to work from userspace.
Code	dir.c - directory operations file.c - file handling operations (like open) llite_lib.c - general support (init/cleanp/parse options) lutil.c - supplementary code to get IP addresses and init various structures needed to emulate the normal Linux process from other layers' perspective. namei.c - metadata operations code. rw.c - I/O code, including read/write super.c - "superblock" operation - mounting/umounting, inode operations.tests - directory with liblustre-specific tests.

echo client/server

Summary

The echo_client and obdecho are OBD devices which help testing and performance measurement.

They were implemented originally for network testing - obdecho can replace obdfilter and echo_client can exercise any downstream configurations. They are normally used in the following configurations:

echo_client -> obdfilter. This is used to measure raw backend performance without any network I/O.
echo_client -> OSC -> <network> -> OST -> obdecho. This is used to measure network and ptlrpc performance.
echo_client -> OSC -> <network> -> OST -> obdfilter. This is used to measure performance available to the Lustre client.

Code

lustre/obdecho/

client vfs

Summary

The client VFS interface, also called llite, is the bridge between the Linux kernel and the underlying Lustre infrastructure represented by the LOV, MDC, and LDLM subsystems. This includes mounting the client filesystem, handling name lookups, starting file I/O, and handling file permissions.

The Linux VFS interface shares a lot in common with the liblustre interface, which is used in the Catamount environment; as of yet, the code for these two subsystems is not common and contains a lot of duplication.

Code

lustre/llite/dcache.c          - Interface with Linux dentry cache/intents
lustre/llite/dir.c             - readdir handling, filetype in dir, dir ioctl
lustre/llite/file.c            - File handles, file ioctl, DLM extent locks
lustre/llite/llite_close.c     - File close for opencache
lustre/llite/llite_internal.h  - Llite internal function prototypes, structures
lustre/llite/llite_lib.c       - Majority of request handling, client mount
lustre/llite/llite_mmap.c      - Memory-mapped I/O
lustre/llite/llite_nfs.c       - NFS export from clients
lustre/llite/lloop.c           - Loop-like block device export from object
lustre/llite/lproc_llite.c     - /proc interface for tunables, statistics
lustre/llite/namei.c           - Filename lookup, intent handling
lustre/llite/rw24.c            - Linux 2.4 IO handling routines
lustre/llite/rw26.c            - Linux 2.6 IO handling routines
lustre/llite/rw.c              - Linux generic IO handling routines
lustre/llite/statahead.c       - Directory statahead for "ls -l" and "rm -r"
lustre/llite/super25.c         - Linux 2.6 VFS file method registration
lustre/llite/super.c           - Linux 2.4 VFS file method registration
lustre/llite/symlink.c         - Symbolic links
lustre/llite/xattr.c           - User-extended attributes

| colspan="2" valign="top | libcfs |- | Summary | Libcfs provides an API comprising fundamental primitives and subsystems - e.g. process management and debugging support which is used throughout LNET, Lustre, and associated utilities. This API defines a portable runtime environment that is implemented consistently on all supported build targets. |- | Code |

lustre/lnet/libcfs/**/*.[ch]

|}

lnet
Summary	LNET = the Lustre Networking subsystem. See the Lustre Networking white paper for details.
Code	lustre/lnet/*/.[ch]

ptlrpc

Summary

Ptlrpc implements Lustre communications over LNET.

All communication between Lustre processes are handled by RPCs, in which a request is sent to an advertised service, and the service processes the request and returns a reply. Note that a service may be offered by any Lustre process - e.g. the OST service on an OSS processes I/O requests and the AST service on a client processes notifications of lock conflicts.

The initial request message of an RPC is special - it is received into the first available request buffer at the destination. All other communications involved in an RPC are like RDMAs - the peer targets them specifically. For example, in a bulk read, the OSC posts reply and bulk buffers and sends descriptors for them (the LNET matchbits used to post them) in the RPC request. After the server has received the request, it GETs or PUTs the bulk data and PUTs the RPC reply directly.

Ptlrpc ensures all resources involved in an RPC are freed in finite time. If the RPC does not complete within a timeout, all buffers associated with the RPC must be unlinked. These buffers are still accessible to the network until their completion events have been delivered.

Code

lustre/ptlrpc/*.[ch]
lustre/ldlm/ldlm_lib.c

llog

Summary

Overview

LLog is the generic logging mechanism in Lustre. It allows Lustre to store records in an appropriate format and access them later using a reasonable API.

LLog is used is various cases. The main LLog use cases are the following:

mountconf - entire cluster configuration is stored on the MGS in a special configuration llog. A client may access it via an llog API working over ptlrpc;

MDS_OST llog - contains records for unlink and setattr operations, performed on the MDS in the last, not committed transaction. This is needed to preserve consistency between MDS and OST nodes for failure cases. General case: If the MDS does not have an inode for some file, then the OST also should not have object for the same file. So, when the OST fails in the middle of unlink and loses the last transaction containing unlink for the OST object, this may cause the object to be lost on the OST. On the MDS, the current transaction with the unlinked object has finished and the MDS has no inode for the file. This means that the file cannot be accessed later and it just eats up space on the OST. The solution is to maintain the unlink log on the MDS and process it at MDS-OST connect time to make sure the OST has all objects unlinked;

Size llog - this is not yet used, but is planned to log object size changes on the OST so the MDS can later check if it has object size coherence with the MDS (SOM case);

LOVEA llog - joins the file LOV EA merge log.

General design

Each llog type has two main parts:

ORIG llog - "server" part, the site where llog records are stored. It provides an API for local and/or network llog access (read, modify). Examples of ORIG logs: MDS is orig for MDS_OST llog and MGS is orig for config logs;

REPL llog - "client" part, the site where llog records may be used. Examples of REPL logs: OST is repl for MDS_OST llog and MGC is repl for config logs.

Code

obdclass/llog.c
obdclass/llog_cat.c
obdclass/llog_lvfs.c
obdclass/llog_obd.c
obdclass/llog_swab.c
obdclass/llog_test.c
lov/lov_log.c
ptlrpc/llog_client.c
ptlrpc/llog_server.c
ptlrpc/llog_net.c

obdclass

Summary

The obdclass code is generic Lustre configuration and device handling. Different functional parts of the Lustre code are split into obd devices which can be configured and connected in various ways to form a server or client filesystem.

Several examples of obd devices include:

OSC - object storage client (connects over network to OST)
OST - object storage target
LOV - logical object volume (aggregates multipe OSCs into a single virtual device)
MDC - meta data client (connects over network to MDT)
MDT - meta data target

The obdclass code provides services used by all Lustre devices for configuration, memory allocation, generic hashing, kernel interface routines, random number generation, etc.

Code

lustre/obdclass/class_hash.c     - scalable hash code for imports
lustre/obdclass/class_obd.c      - base device handling code
lustre/obdclass/debug.c          - helper routines for dumping data structs
lustre/obdclass/genops.c         - device allocation/configuration/connection
lustre/obdclass/linux-module.c   - linux kernel module handling
lustre/obdclass/linux-obdo.c     - pack/unpack obdo and other IO structs
lustre/obdclass/linux-sysctl.c   - /proc/sys configuration parameters
lustre/obdclass/lprocfs_status.c - /proc/fs/lustre configuration/stats, helpers
lustre/obdclass/lustre_handles.c - wire opaque pointer handlers
lustre/obdclass/lustre_peer.c    - peer target identification by UUID
lustre/obdclass/obd_config.c     - configuration file parsing
lustre/obdclass/obd_mount.c      - server filesystem mounting
lustre/obdclass/obdo.c           - more obdo handling helpers
lustre/obdclass/statfs_pack.c    - statfs helpers for wire pack/unpack
lustre/obdclass/uuid.c           - UUID pack/unpack
lustre/lvfs/lvfs_common.c        - kernel interface helpers
lustre/lvfs/lvfs_darwin.c        - darwin kernel helper routines
lustre/lvfs/lvfs_internal.h      - lvfs internal function prototypes
lustre/lvfs/lvfs_lib.c           - statistics
lustre/lvfs/lvfs_linux.c         - linux kernel helper routines
lustre/lvfs/lvfs_userfs.c        - userspace helper routines
lustre/lvfs/prng.c               - long period pseudo-random number generator
lustre/lvfs/upcall_cache.c       - supplementary group upcall for MDS

luclass

Summary

luclass is a body of data-type definitions and functions implementing support for a layered object, that is an entity where every layer in the Lustre device stack (both data and meta-data, and both client and server side) can maintain its own private state, and modify a behavior of a compound object in a systematic way.

Specifically, data-types are introduced, representing a device type (struct lu_device_type, layer in the Lustre stack), a device (struct lu_device, a specific instance of the type), and object (struct lu_object). Following lu_object functionality is implemented by a generic code:

Compound object is uniquely identified by a FID, and is stored in a hash table, indexed by a FID;

Objects are kept in a LRU list, and a method to purge least recently accessed objects in reaction to the memory pressure is provided;

Objects are reference counted, and cached;

Every object has a list of layers (also known as slices), where devices can store their private state. Also, every slice comes with a pointer to an operations vector, allowing device to modify object's behavior.

In addition to objects and devices, luclass includes lu_context, which is a way to efficiently allocate space, without consuming stack space.

luclass design is specified in the MD API DLD.

Code

include/lu_object.h
obdclass/lu_object.c