http://wiki.old.lustre.org/api.php?action=feedcontributions&user=Adilger&feedformat=atomObsolete Lustre Wiki - User contributions [en]2024-03-29T07:17:09ZUser contributionsMediaWiki 1.35.5http://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=13262Lustre Project List2013-05-09T16:25:14Z<p>Adilger: Remove obsolete projects</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] [http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. quotas to limit (or deny) specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=GSS_/_Kerberos&diff=13218GSS / Kerberos2013-04-12T16:26:46Z<p>Adilger: Undo revision 12284 by Delphia Mulcahey (talk)</p>
<hr />
<div>'''Note:''' Only the HEAD branch supports GSS/Kerberos functionality. It is subject to changes at any time, and backward compatibility is NOT guaranteed.<br />
<br />
= Kerberos Lustre Setup =<br />
<br />
== Security Flavor ==<br />
A security flavor is a string to describe what kind authentication and data transformation be performed upon a PTLRPC connection. It covers both RPC message and BULK data.<br />
<br />
The support flavors are described in following table:<br />
<br />
{|border=1 cellspacing=0<br />
|bgcolor=#E6E6E6| Base Flavor||bgcolor= #E6E6E6|Authentication||bgcolor=#E6E6E6|RPC Message Protection||bgcolor=#E6E6E6|Bulk Data Protection||bgcolor=#E6E6E6|Notes<br />
|-<br />
|'''''null'''''||N/A ||N/A ||N/A '''[*]''' ||Almost no performance overhead. The on-wire rpc format is compatible with old versions (1.4.x, 1.6.x, 1.8.x).<br />
|-<br />
|'''''plain'''''||N/A ||N/A ||checksum||(obsolete)<br />
|-<br />
|'''''krb5n'''''||GSS/Kerberos5 ||null||checksum (adler32)||No protection of rpc message, adler32 checksum protection of bulk data, light performance overhead.<br />
|-<br />
|'''''krb5a'''''||GSS/Kerberos5 ||partly integrity (krb5)||checksum (adler32)||Only header of rpc message is integrity protected, adler32 checksum protection of bulk data, more performance overhead compare to krb5n. <br />
|-<br />
|'''''krb5i'''''||GSS/Kerberos5 ||integrity (krb5)||integrity (krb5)||transformation algorithm is determined by actual Kerberos algorithms in use; Heavy performance penalty. <br />
|-<br />
|'''''krb5p'''''||GSS/Kerberos5 ||privacy (krb5)||privacy (krb5)||transformation privacy protection algorithm is determined by actual Kerberos algorithms in use; The heaviest performance penalty.<br />
|}<br />
<br />
'''[*]''' In Lustre 1.4 and 1.6 it is possible to enable bulk data checksumming to provide integrity checking using CRC32. In 1.6.5 this is expected to be the default behaviour, using the Adler32 mechanism by default (lower CPU overhead than CRC32).<br />
<br />
In the future, we may want to support customize flavor to some extend. For example, allow set different flavors for RPC message and bulk data.<br />
<br />
== Kerberos Setup ==<br />
=== Distribution ===<br />
* We only support MIT Kerberos 5, version from 1.3.x to latest 1.6.x.<br />
<br />
=== Configuration ===<br />
1. Configure client nodes:<br />
*For each client node, create a lustre_root principal and generate keytab.<br />
kadmin> addprinc -randkey lustre_root/client_host.domain@REALM<br />
kadmin> ktadd -e aes128-cts:normal lustre_root/client_host.domain@REALM <br />
*Install the keytab on the client node.<br />
<br />
2. Configure MDS node:<br />
*For each MDS node, create a lustre_mds principal and generate keytab.<br />
kadmin> addprinc -randkey lustre_mds/mds_host.domain@REALM<br />
kadmin> ktadd -e aes128-cts:normal lustre_mds/mds_host.domain@REALM<br />
*Install the keytab on the MDS node.<br />
<br />
3. Configure OSS node:<br />
*For each OSS node, create a lustre_oss principal and generate keytab.<br />
kadmin> addprinc -randkey lustre_oss/oss_host.domain@REALM<br />
kadmin> ktadd -e aes128-cts:normal lustre_oss/oss_host.domain@REALM<br />
*Install the keytab on the OSS node.<br />
<br />
NOTES:<br />
*The ''host.domain'' should be the FQDN in your network, otherwise server might not recognize any GSS request.<br />
<br />
*As an alternative of the client keytab, if you want to save the trouble of assigning unique keytab for each client node, you can create a general lustre_root principal and its keytab, and install the same keytab on as many client nodes as you want. '''But be aware that in this way one compromised client means all clients are insecure'''.<br />
kadmin> addprinc -randkey lustre_root@REALM<br />
kadmin> ktadd -e aes128-cts:normal lustre_root@REALM<br />
<br />
*To merge keytab files, you need the tool '''''ktutil''''', for more details please refers to manual of ktutil.<br />
<br />
*Lustre support following ''enctypes'' for MIT Kerberos 5 version 1.4 or higher:<br />
**<u>''des-cbc-md5''</u><br />
**<u>''des3-hmac-sha1''</u><br />
**<u>''aes128-cts''</u><br />
**<u>''aes256-cts''</u><br />
<br />
*For MIT Kerberos 1.3.x, only ''des-cbc-md5'' works because a known issue between libgssapi and Kerberos library.<br />
<br />
== Required packages ==<br />
Every node should have follow packages installed:<br />
* '''''libgssapi''''' version 0.10 or higher. Some newer Linux distributions already come with it. If not, build & install from source: http://www.citi.umich.edu/projects/nfsv4/linux/libgssapi/libgssapi-0.11.tar.gz<br />
* '''''keyutils'''''<br />
<br />
== Kernel & Environment ==<br />
* System wide configuration:<br />
On Each node (MDT, OST, Client) following line should be added into /etc/fstab to be automatically mounted<br />
nfsd /proc/fs/nfsd nfsd defaults 0 0 <br />
Each MDT and Client node add following line into /etc/request-key.conf:<br />
create lgssc * * /usr/sbin/lgss_keyring %o %k %t %d %c %u %g %T %P %S<br />
Note you might need to replace '''/usr/sbin/lgss_keyring''' in above line to the actual path to lgss_keyring binary in your setting.<br />
<br />
* Networking:<br />
If you are using network which is '''NOT''' TCP or Infiniband (e.g. Quadrics Elan, Myrinet, etc), you need configure a '''''/etc/lustre/nid2hostname''''' on '''each''' server node (MDT & OST), which is a simple script to translate NID into hostname. Following is sample on a Elan cluster:<br />
<br />
#!/bin/bash<br />
set -x<br />
exec 2>/tmp/$(basename $0).debug<br />
<br />
# convert a NID for a LND to a hostname, for GSS for example<br />
<br />
# called with thre arguments: lnd netid nid<br />
# $lnd will be string "QSWLND", "GMLND", etc.<br />
# $netid will be number in hex string format, like "0x16", etc.<br />
# $nid has the same format as $netid<br />
# output the corresponding hostname, or error message leaded by a '@' for error logging.<br />
<br />
lnd=$1<br />
netid=$2<br />
nid=$3<br />
<br />
# uppercase the hex<br />
nid=$(echo $nid | tr '[abcdef]' '[ABCDEF]')<br />
# and convert to decimal<br />
nid=$(echo -e "ibase=16\n${nid/#0x}" | bc)<br />
case $lnd in<br />
QSWLND) # simply stick "mtn" on the front<br />
echo "mtn$nid"<br />
;;<br />
*) echo "@unknown LND: $lnd"<br />
;;<br />
esac<br />
<br />
== Build Lustre ==<br />
Enable GSS during configuration:<br />
<br />
./configure --enable-gss --other-options<br />
<br />
== Running ==<br />
=== GSS Daemons ===<br />
Make sure start the daemon process '''lsvcgssd''' on each OST and MDT node before starting Lustre. The command syntax is:<br />
lsvcgssd [-f] [-v]<br />
* ''-f'': running at foreground instead of as daemon, thus output error/warning messages to front console instead of system log.<br />
* ''-v'': increase verbosity by 1. The default is 0, maximum is 4.<br />
<br />
=== Setting Security Flavors ===<br />
Note: If nothing specified, by default all RPC connections will use '''''null'''''.<br />
<br />
On MGS there's a persistent sptlrpc rule database, by specifying set of rules you can change security flavors between nodes. A rule is in form of:<br />
<spec>=<flavor><br />
Rules can be manipulated on MGS node. To add a rule:<br />
mgs> lctl conf_param <spec>=<flavor><br />
If there a existing rule of <spec> part, it will overwritten.<br />
<br />
To delete a rule:<br />
mgs> lctl conf_param -d <spec><br />
<br />
Current rule set could be obtained by:<br />
msg> cat /proc/fs/lustre/mgs/<mgs-name>/live/<fs-name> | grep "srpc.flavor"<br />
<br />
'''Note''':<br />
* Rules have persistent storage on MGS, so it applied across re-mount.<br />
* It doesn't matter in which order you add a set of rules, lustre keep rules in certain order or priority.<br />
* After you changed a rule, usually it will take the system within 1 minutes to apply the new rules to all nodes, depend on system load.<br />
* Before you change a rule, make sure affected nodes are ready for the new security flavor. E.g. you changed flavor from '''''null''''' to '''''krb5p''''' but GSS/Kerberos env is not properly configured on affected nodes, those nodes might be evicted because they can't communicate with others.<br />
* You can also specify rules via device on-disk parameters, by mke2fs.lustre or tune2fs.lustre. The syntax is the same, and the rule only applied to connections to this specific target (MDT/OST).<br />
<br />
=== Rules Syntax & Examples ===<br />
The general syntax is:<br />
<target>.srpc.flavor.<network>[.<direction>]=flavor<br />
<br />
* <target>: could be filesystem name, or specific MDT/OST device name. For example, ''lustre'', ''lustre-MDT0000'', ''lustre-OST0001'', etc.<br />
* <network>: LNET network name of the RPC initiator. For example, ''tcp0'', ''elan1'', ''o2ib0''.<br />
* <direction>: could be one of ''cli2mdt'', ''cli2ost'', ''mdt2mdt'', ''mdt2ost''. In most cases you don't need to specify <direction> part.<br />
<br />
Examples:<br />
* Apply ''krb5i'' on '''ALL''' connections:<br />
mgs> lctl conf_param lustre.srpc.flavor.default=krb5i<br />
<br />
* Nodes in network ''tcp0'' use ''krb5p''; All other nodes use ''null''<br />
mgs> lctl conf_param lustre.srpc.flavor.tcp0=krb5p<br />
mgs> lctl conf_param lustre.srpc.flavor.default=null<br />
<br />
* Nodes in network ''tcp0'' use ''krb5p''; Nodes in ''elan1'' use ''plain''; Amount other nodes, clients use ''krb5i'' to MDT/OST, MDT use ''null'' to other MDTs, MDT use ''plain'' to OSTs.<br />
mgs> lctl conf_param lustre.srpc.flavor.tcp0=krb5p<br />
mgs> lctl conf_param lustre.srpc.flavor.elan1=plain<br />
mgs> lctl conf_param lustre.srpc.flavor.default.cli2mdt=krb5i<br />
mgs> lctl conf_param lustre.srpc.flavor.default.cli2ost=krb5i<br />
mgs> lctl conf_param lustre.srpc.flavor.default.mdt2mdt=null<br />
mgs> lctl conf_param lustre.srpc.flavor.default.mdt2ost=plain<br />
<br />
=== Authenticate Normal Users ===<br />
On client nodes, a non-root user need '''''kinit''''' before accessing Lustre, just like other Kerberized applications.<br />
* Required by kerberos, the user's principal (''username@REALM'') should be added into KDC.<br />
* Client and MDT nodes should have the same user database, i.e. the user name and uid/gid translation.<br />
A use could destroy the established security contexts before logout, by "lfs flushctx":<br />
<br />
lfs flushctx [-k]<br />
<br />
Here "-k" means also destroy the on-disk kerberos credential cache, equals to "kdestroy", otherwise it only destroy established contexts in Lustre kernel memory.<br />
<br />
== Secure MGC - MGS connection ==<br />
Each node can specify what flavor to use to connect to MGS, by option '''''mgssec=flavor''''' upon mounting a target device or client. By default ''null'' is chosen. Once a flavor is chosen, it can't be changed until umount.<br />
<br />
Because each node has only one connection to MGS, so if there's more than one target device or client on a single node, all the "mgssec=" specification must be the same. Or simply missing option "mgssec=" means "using currently chosen flavor.<br />
<br />
By default, MGS accept RPCs with any flavor. But sysad can configure MGS to only accept certain flavor from certain network. The syntax is similar but with target as a special "_mgs":<br />
mgs> lctl conf_param _mgs.srpc.flavor.<network>=flavor<br />
'''NOTE:''' apply inappropriate flavor may lead to a node never be able to communicate with MGS until restart. So use it carefully.<br />
<br />
== Cross-Realms Authentication ==<br />
Due to idmap functionality is missing, we don't support cross-realm authentication currently.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Architecture_-_Interoperability_fids_zfs&diff=12636Architecture - Interoperability fids zfs2012-10-04T07:19:58Z<p>Adilger: /* NEW.0 */ Fix LLOG and ECHO group numbers error to match FID_SEQ_LLOG/FID_SEQ_ECHO</p>
<hr />
<div>'''''Note:''''' ''The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain both outdated information and unimplemented functionality.'' <br />
<br />
== Summary ==<br />
<br />
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.<br />
<br />
== Definitions ==<br />
<br />
As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.<br />
<br />
; '''OLD''' : any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.<br />
; '''OLD.x''' : a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)<br />
; '''NEW.0''' : first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.<br />
; '''NEW.1''' : next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.<br />
; '''OLD protocol''' : b1_6 wire network protocol.<br />
; '''NEW protocol''' : wire protocol using fids for object identification.<br />
; '''OLD storage''', '''OLD file system''' : back-end file system of type ldiskfs.<br />
; '''DMU storage''' : back-end file system implemented through DMU.<br />
; '''fill-in-fid''' : a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.<br />
<br />
== Requirements ==<br />
<br />
; '''+-1 rule''' : adhere to the Lustre promise of maintaining interoperability one release back and forth.<br />
; '''downgrade''' : users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.<br />
; '''rolling upgrade''' : an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.<br />
; '''continuity''' : where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.<br />
; '''no stop-the-world''' : migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).<br />
<br />
== Compatibility matrix ==<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|-<br />
! || colspan=3|OLD || colspan=3|OLD.x || colspan=3|NEW.0 ||colspan=3|NEW.1<br />
|-<br />
! || C || O || M || C || O || M || C || O || M || C || O || M <br />
|-<br />
! OLD protocol <br />
| X || X || X || X || X || X || - || X || - || - || - || -<br />
|-<br />
! NEW protocol <br />
| - || - || - || X || - || - || X || - || X || X || X || X<br />
|-<br />
! OLD storage <br />
| bgcolor=707070| || X || X || bgcolor=707070| || X || X || bgcolor=707070| || X || X || bgcolor=707070| || - || -<br />
|-<br />
! DMU storage <br />
| bgcolor=707070| || - || - ||bgcolor=707070| || - || - ||bgcolor=707070| || - || - ||bgcolor=707070| || X || X<br />
|}<br />
<br />
Legend<br />
<br />
; '''C''' : client<br />
; '''O''' : OSS<br />
; '''M''' : MDT<br />
; '''X''' : given version supports given format or protocol<br />
; '''-''' : given version does not support given format or protocol<br />
; gray area : impossible combination<br />
<br />
== Migration path ==<br />
<br />
Following upgrade path is envisaged:<br />
<br />
* starting with OLD version installed on the cluster...<br />
* OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.<br />
* all clients are upgraded to OLD.x.<br />
* NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).<br />
* NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).<br />
* clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.<br />
* MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.<br />
* NEW.1 dmu based ost's are formatted and added to the cluster.<br />
* online migration of data starts. This step can be undone without loss of functionality or availability.<br />
* NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. '''?Q not clear yet. Downgrade?''' <br />
* once meta-data are migrated to the NEW.1, upgrade is complete.<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|-<br />
! Label || Client || OSS || MDT || Upgrade comment (read top-to-bottom) || Downgrade comments (read bottom-to-top)<br />
|-<br />
| all-old || OLD || OLD || OLD || original configiration ||rowspan="3"|downgrade of clients, OSS and MDT to OLD can be performed in any order<br />
|-<br />
| client-old.x || OLD.x || OLD || OLD ||rowspan="3"|upgrade of clients, OSS and MDT to OLD.x can be performed in any order<br />
|-<br />
| oss-old.x || OLD.x || OLD.x || OLD ||<br />
|-<br />
| all-old.x || OLD.x || OLD.x || OLD.x || MDT is failed over to OLD.x version. On reconnect clients and OSS servers recognize downgrade and switch to the OLD protocol.<br />
|-<br />
|mdt-new.0 || OLD.x ||OLD.x || NEW.0 || as new server is failed over to, OLD.x clients recognize this and start using NEW protocol to talk to MDT. OST still uses OLD protocol to talk to the MDT. ||rowspan="2"| clients are downgraded to OLD.x version in any order. They continue to speak NEW protocol. If SOM was activated during upgrade, no further downgrade is possible.<br />
|-<br />
|client-new.0 || NEW.0 ||OLD.x || NEW.0 || rowspan="2"|clients and OSSes are upgraded to NEW-protocol-only version in any order.<br />
|-<br />
|all-new.0 || NEW.0 ||NEW.0 || NEW.0 || SOM is de-activated on the MDT, if it was enabled.<br />
|-<br />
|new.0-som || NEW.0 ||NEW.0 || NEW.0 || (Optional) SOM is activated on the MDT. || all data are in OLD format.<br />
|-<br />
|client-new.1 || NEW.1 || NEW.1 || NEW.0 || Clients and OST's are upgarded to NEW.1 in any order. OST's continue to talk to the MDT using old protocol. || OST's migrate back to NEW.0<br />
|-<br />
|mdt.1 || NEW.1 || NEW.1 || NEW.1 || MDT fails over to NEW.1 version, and announced to OST's that it talks NEW protocol. OST's switch to NEW protocol on reconnect || MDT fails over to the NEW.0 version. OST's switch to the OLD protocol on reconnect.<br />
|-<br />
|data.dmu || NEW.1 || NEW.1 || NEW.1 || New DMU-based OST's are formatted and added to the cluster. Data migration starts. || ldiskfs-based NEW.1 OST's are added into cluster and data are migrated back to them.<br />
|-<br />
|all-data.dmu || NEW.1 ||NEW.1 || NEW.1 || all data are on DMU OSS servers.|| original configuration<br />
|-<br />
|colspan="6"|point-of-no-return.<br />
|-<br />
|all-dmu || NEW.0 ||NEW.1 || NEW.1 || meta-data is converted (offline?) to new DMU based MDT.|| downgrade is not possible from here.<br />
|}<br />
<br />
== Use Cases ==<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|old.x-client || usability || OLD.x client is introduced into otherwise OLD cluster.<br />
|-<br />
|mdt.upgrade.0 || usability, availability || OLD.x MDT fails over to NEW.0 MDT<br />
|-<br />
|mdt.upgrade.0.client ||availability || "...": client reconnection and recovery<br />
|-<br />
|new.1-ost || usability || NEW.1 OST is added to a cluster containing NEW.1 clients.<br />
|-<br />
|mdt.upgrade || usability, availability || NEW.0 MDT fails over to NEW.1 MDT<br />
|-<br />
|mdt.upgrade.1.ost ||availability || "...": OST reconnection and recovery<br />
|-<br />
|mdt.downgrade.0 || usability, availability || NEW.0 MDT fails over to OLD.x MDT.<br />
|-<br />
|mdt.downgrade.0.client ||availability || "...": client reconnection and recovery<br />
|-<br />
|mdt.downgrade.1 || usability, availability || NEW.1 MDT fails over to NEW.0 MDT.<br />
|-<br />
|mdt.downgrade.1.ost ||availability || "...": OST reconnection and recovery<br />
|}<br />
<br />
NEW.0 MDT handles...<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|mdt.lookup.old || correctness || LOOKUP for a file created by OLD MDT<br />
|-<br />
|mdt.lookup.new.0 || correctness || LOOKUP for a file created by NEW.0 MDT<br />
|-<br />
|mdt.create || correctness || CREATE with a fid supplied by a client<br />
|-<br />
|mdt.readdir || correctness || READDIR<br />
|}<br />
<br />
NEW.0 OST handles ...<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|ost.lookup.old || correctness || LOOKUP for a file created by OLD OST<br />
|-<br />
|ost.lookup.new.0 || correctness || LOOKUP for a file created by NEW.0 OST<br />
|-<br />
|ost.create || correctness || CREATE with a fid supplied by a client<br />
|-<br />
|ost.unlink || correctness || UNLINK<br />
|}<br />
<br />
== Quality Attribute Scenarios ==<br />
<br />
; '''old.x-client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x client is introduced into otherwise OLD cluster.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || permit rolling upgrade<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with OLD release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| lustre client<br />
|-align="left"<br />
|'''Response:'''|| OLD client unmounts, OLD.x release is installed on a cluster node. Client connects to the MDT, requesting OBD_CONNECT_FID, which is not granted. Client detects that it connected to the OLD MDT.<br />
|-align="left"<br />
|'''Response measure:'''|| client should be able to talk to the OLD MDT.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| <br />
|-<br />
|}<br />
<br />
; '''new.1-ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 OST is added to a cluster containing NEW.1 clients<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || permit rolling server upgrade<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| NEW.0 OST fails over to NEW.1 version. OST reconnects to MDT, requesting OBD_CONNECT_FID, which is not granted. OST detects that it connected to NEW.0 MDT, and clears OBD_CONNECT_FID bit in '''its''' supported connection flags mask, forcing all reconnecting clients into OLD mode.<br />
|-align="left"<br />
|'''Response measure:'''|| OST should be able to talk to the NEW.0 MDT and NEW.0 clients.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| <br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.0'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x MDT fails over to NEW.0 MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || upgrade to NEW.0 without downtime<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with OLD.x release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over MDT creates missing NEW.0 files (/oi, /fld, /seq, etc.), and starts recovery, accepting NEW-protocol connections from the clients, and OLD protocol connections from OS servers. When receiving replay of a CREATE rpc with a fill-in-fid, MDT generates fid internally (using seq service), and returns it to client.<br />
|-align="left"<br />
|'''Response measure:'''|| Fail-over and recovery have to complete successfully<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| recovery, see following scenarios<br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.0.client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x MDT fails over to NEW.0 MDT, client reconnects and replays.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || successful recovery <br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| client<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over client gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.0 MDT. It continues to use OLD protocol to talk to OST's. Client proceeds with recovery, converting requests into new format, and converting inode numbers in RPCs into fids. For CREATE RPCs, some otherwise impossible fill-in-fid (from system-reserved fid sequence) is used, to indicate that server has to generate fid. Client should be ready that server can over-write client supplied fid in any CREATE rpc. There should be no need to rebuild any internal data structures (locks, inode table, pages, etc.) as all objects are identified by fids internally in OLD.x mode.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.1.ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to NEW.1 MDT, OST reconnects and replays.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || successful recovery <br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over OST gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.1 MDT. OST sets OBD_CONNECT_FID in its own supported connect bits mask. OST proceeds with MDT-OST recovery, converting requests into new format, and converting inode numbers in RPCs into fids.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.0'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to OLD.x MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, MDT starts OLD-protocol recovery, accepting connections in OLD protocol.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.0.client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to OLD.x MDT: client reconnection and recovery<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| client<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, client reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, client switches to OLD.x mode, and starts replay, converting RPCs to the OLD protocol. If client is unable to convert an RPC, because it doesn't know inode number corresponding to the fid, it evicts itself.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| Search for "KABOOM" on this page.<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.1'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 MDT fails over to NEW.0 MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, MDT starts recovery, accepting connections in OLD protocol from OST's and in NEW protocol from clients.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.1.ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 MDT fails over to NEW.0 MDT: ost reconnection and recovery<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.0 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, OST reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, OST switches to NEW.0 mode, clears OBD_CONNECT_FID bit in its supported connect flags mask, and starts replay, converting RPCs to the OLD protocol.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| Search for "KABOOM" on this page.<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.lookup.old'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by OLD.x server.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || access to existing data and meta-data<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA doesn't exists (see next QAS otherwise), server learns that inode was created by OLD.x server, generates igif fid from (inode number, inode generation) pair, and sends this fid to client as lookup result.<br />
|-align="left"<br />
|'''Response measure:'''|| consistent lookup result that can later be used to access file<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.lookup.new'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by NEW.0 server.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || access to newly created data and meta-data<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA exists (see previous QAS otherwise), server learns that inode was created by NEW.0 server, interprets EA contents as a fid, and sends this fid to client as lookup result.<br />
|-align="left"<br />
|'''Response measure:'''|| consistent lookup result that can later be used to access file<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| Possible sanity check: once fid was determined, check that /oi maps this fid to the inode number that was found in the directory.<br />
|-<br />
|}<br />
<br />
; '''mdt.create'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles CREATE(fid) RPC, with fid supplied by a client<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || create object that can later be accessed through client supplied fid.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| If fid equals to special fill-in-fid constant, MDT generates new fid from an internal fid sequence. New inode is created. "FID" EA is allocated for this inode and filled with the fid. New (inode-number, inode-generation) record is inserted into /oi index with the fid as a key.<br />
|-align="left"<br />
|'''Response measure:'''|| new object created, and can be accessed by fid later.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.readdir'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles READPAGE(parent-fid, offset) RPC<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || return a page filled with NEW protocol directory entries, provide access to both new and old objects through readdir.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Using dt-index iterators interface (internally based on ldiskfs_readdir()), MDT iterates over directory entries, and places file names and their hashed into directory entries. For every entry corresponding inode is loaded into memory. If inode contains "FID" EA, its contents is used as a fid, and is placed into readdir page. Otherwise, igif fid is generated, and placed into readdir page.<br />
|-align="left"<br />
|'''Response measure:'''|| pre-existing object, created by OLD.x server, are visible through readdir.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
== Technical Details [not part of architecture, should go into HLD/DLD]==<br />
<br />
Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:<br />
<br />
=== OLD.x ===<br />
<br />
* OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.<br />
* once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.<br />
* when connected in NEW node, OLD.x client<br />
** uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids ['''done for HEAD, being ported to b1_6_cli_reqs'''];<br />
** expects cmd3-style directory pages in readdir with fids in directory entries ['''done'''];<br />
** takes dlm locks are in fid name-space ['''done'''];<br />
** participates in cmd3 recovery protocol, more on this below ['''being implemented by Amit'''];<br />
** uses seq and fld services ['''done'''];<br />
* when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.<br />
** No interoperability changes to the MD server code are made in OLD.x release. <br />
* OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). ['''not started yet''']<br />
<br />
=== NEW.0 ===<br />
<br />
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.<br />
<br />
To talk in new protocol server has to use FIDs to identify object, so NEW.0 MDT generates ''surrogate'' FIDs for existing inodes. Such a surrogate FIDs is referred to as an ''IGIF'' (inode-generation FID), because it is built from inode number and inode generation. Similarly, NEW.0 OST generates surrogate FIDs for existing id/group objects. Format of IGIF and IDIF is described in the table below:<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|fields ||SEQ ||OID ||VER<br />
|-<br />
|FID_SEQ_OST_MDT0 ||= 0 || ||<br />
|-<br />
|FID_SEQ_LLOG ||= 1 || ||<br />
|-<br />
|FID_SEQ_ECHO ||= 2 || ||<br />
|-<br />
|FID_SEQ_OST_MDT1 ||= 3 || ||<br />
|-<br />
|FID_SEQ_OST_MAX ||= 9 (=FID_SEQ_OST_MDT7) || ||<br />
|-<br />
|FID_SEQ_IGIF ||= 12 || ||<br />
|-<br />
|FID_SEQ_IGIF_MAX ||= 0xffffffff || ||<br />
|-<br />
|FID_SEQ_IDIF ||=0x100000000 || ||<br />
|-<br />
|FID_SEQ_IDIF_MAX ||=0x1ffffffff || ||<br />
|-<br />
|FID_SEQ_LOCAL_FILE||=0x200000001 || ||<br />
|-<br />
|FID_SEQ_DOT_LUSTRE||=0x200000002 || ||<br />
|-<br />
|FID_SEQ_NORMAL ||=0x200000400 || ||<br />
|-<br />
|-<br />
|obdo/lmm/oinfo(OLD)||o_seq:64 [FID_SEQ_OST_MDT0] ||o_id_lo:48||o_id_hi:16<br />
|-<br />
|obdo/lmm/oinfo(NEW.1)||o_seq:64 [FID_SEQ_{IDIF,NORMAL}]||o_id_lo:32||o_id_hi:32<br />
|-<br />
|lu_fid ||f_seq:64 ||f_oid:32 ||f_ver:32<br />
|-<br />
|IGIF ||0:32, ino:32 [12,FID_SEQ_IGIF_MAX] ||gen:32 ||0:32<br />
|-<br />
|IDIF ||0:31, 1:1, ost_idx:16,o_id_hi:16 ||o_id_lo:32||o_id_hi_hi:16<br />
|-<br />
|reserved ||[FID_SEQ_START,FID_SEQ_START+0x3ff]||f_oid:32 ||f_ver:32<br />
|-<br />
|FID ||[FID_SEQ_NORMAL,2<sup>64</sup>-1] ||f_oid:32 ||f_ver:32<br />
|}<br />
<br />
Legend:<br />
; '''FID''' : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [2<sup>33</sup>, 2<sup>33</sup> + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, but this is in the IGIF SEQ rangeand does not conflict with assigned FIDs.<br />
<br />
; '''IGIF''' : Inode and Generation In FID, a surrogate FID used to globally identify an existing object on OLD formatted MDT file system. This would only be used on MDT0 in a DNE filesystem, because there are not expected to be any OLD formatted DNE filesystems. Belongs to a sequence in [12, 2<sup>32</sup> - 1] range, where sequence number is inode number, and inode generation is used as OID. '''NOTE''': This assumes no more than 2<sup>32</sup>-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend. '''NOTE''': This assumes that the reserved ext3/ext4/ldiskfs inode numbers [0-11] are never visible to clients, which has always been true.<br />
<br />
; '''IDIF''' : object ID in FID, a surrogate FID used to globally identify an existing object on OLD formatted OST file system. Belongs to a sequence in [2<sup>32</sup>, 2<sup>33</sup> - 1]. Sequence number is calculated as:<br />
<pre><br />
1 << 32 | (ost_index << 16) | ((objid >> 32) & 0xffff)<br />
</pre><br />
; ''' ''' : that is, SEQ consists of 16-bit OST index, and higher 16 bits of object ID. The generation of unique SEQ values per OST allows the IDIF FIDs to be identified in the FLD correctly. The OID field is calculated as:<br />
<pre><br />
objid & 0xffffffff<br />
</pre><br />
; ''' ''' : that is, it consists of lower 32 bits of object ID. '''NOTE''' This assumes that no more than 2<sup>48</sup>-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use. Both are very reasonable assumptions (can uniquely map all objects on an OST that created 1M objects per second for 9 years, or combinations thereof).<br />
<br />
; '''OST_MDT0''' : Surrogate FID used to identify an existing object on OLD formatted OST filesystem. Belongs to the reserved sequence 0, and is used internally prior to the introduction of FID-on-OST, at which point IDIF will be used to identify objects as residing on a specific OST.<br />
<br />
; '''LLOG''' : for Lustre Log objects the object sequence 1 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.<br />
<br />
; '''ECHO''' : for testing OST IO performance the object sequence 2 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.<br />
<br />
; '''OST_MDT1''' .. '''OST_MAX''' : for testing with multiple MDTs the object sequence 3 through 9 is used, allowing direct mapping of MDTs 1 through 7 respectively, for a total of 8 MDTs including '''OST_MDT0'''. This matches the legacy CMD project "group" mappings. However, this SEQ range is only for testing prior to any production DNE release, as the objects in this range conflict across all OSTs, as the OST index is not part of the FID.<br />
<br />
<br />
For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.<br />
<br />
DLM OLD OBIF/IDIF:<br />
resource[] = {o_id, o_seq, 0, 0}; /* o_seq == 0 for production releases */<br />
<br />
DLM NEW.1 FID (this is the same for both the MDT and OST):<br />
resource[] = {SEQ, OID, VER, HASH};<br />
<br />
Note that for mapping IDIF values to DLM resource names the o_id may be larger than the 2<sup>33</sup> reserved sequence numbers for IDIF, so it is possible for the o_id numbers to overlap FID SEQ numbers in the resource. However, in all production releases the OLD o_seq field is always zero, and all valid FID OID values are non-zero, so the lock resources will not collide.<br />
<br />
For objects within the IDIF range, group extraction (non-CMD) will be:<br />
o_id = (fid->f_seq & 0x7fff) << 16 | fid->f_oid;<br />
o_seq = 0; /* formerly group number */<br />
<br />
=== Recovery ===<br />
<br />
There are 2 important recovery scenarios related to interoperability:<br />
<br />
* OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from client-assigned FIDs to inode/generation numbers (storage cookies). If a FID in is IGIF format it can be converted to inode number according to the reverse of IGIF generation algorithm. If a FID is client-generated, then '''*KABOOM*'''! Client has to evict itself, because it doesn't know old-format inode number. '''Q? Is there a better solution?'''. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.<br />
<br />
* OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode/generation numbers to FIDs. For RPCs that already include inode number as an argument, IGIF FID can be used. For CREATE RPC that requires fid in NEW protocol there are two options:<br />
** client supplies fill-in-FID. NEW.0 server recognizes this as a request to generate FID on the server, and uses special sequence range reserved for this purpose to allocate a FID from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete FID space.<br />
** client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:<br />
*** a dependency on ext3-wantedi patch is re-introduced, and<br />
*** backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=12526Coding Guidelines2012-09-04T17:25:41Z<p>Adilger: /* Lustre Guidelines */ add comment block and console error messages</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
== Beautiful Code == <br />
<br />
''A note from Eric Barton, a Lustre pioneer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''rq_reqmsg'' and ''rq_repmsg'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* Tabs should be used in all 2.3 and later lustre/, lnet/ and libcfs/ files. This matches the upstream Linux kernel coding style, and is the default method of code indentation.<br />
<br />
* '''NOTE NOTE NOTE''' The use of tabs for indentation is a reversal from previous Lustre coding guidelines, since May 2012 and Lustre 2.3. This is being done in order to facilitate code integration with the Linux kernel. All new patches should be submitted using tabs for ALL modified lines in the patch. If there are 6 or fewer lines using spaces for indentation between two lines changed by a patch, then all of the intervening lines should also have the indentation changed to use tabs. Similarly, if there are only a handful of lines at the start or end a modified function or test that are still using spaces for indentation, convert all of the lines in that function or test to use tabs. In this manner, we can migrate consistent chunks of code over to tabs without having a 250kLOC patch breaking the commit history of every line of code, and also avoid breaking code that is in existing branches/patches that still need to merge. In a year or so, lines that still use spaces (i.e. those that are not under active development) will be converted to using tabs for indentation as well.<br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* The types and printf()/printk() formats used by Lustre code are:<br />
<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned)long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* Functions and files should be documented using the [http://en.wikipedia.org/wiki/Doxygen Doxygen] markup style:<br />
<br />
<pre><br />
/**<br />
* Implements cl_page_operations::cpo_make_ready() method for Linux.<br />
*<br />
* This is called to yank a page referred to by \a slice from the transfer<br />
* cache and to send it out as a part of transfer. This function try-locks<br />
* the page. If try-lock failed, page is owned by some concurrent IO, and<br />
* should be skipped (this is bad, but hopefully rare situation, as it usually<br />
* results in transfer being shorter than possible).<br />
*<br />
* \param env lu environment for large temporary stack variables<br />
* \param slice per-layer page structure being prepared<br />
* \retval 0 success, page can be placed into transfer<br />
* \retval -EAGAIN page is either used by concurrent IO has been<br />
* truncated. Skip it.<br />
*/<br />
static int vvp_page_make_ready(const struct lu_env *env,<br />
const struct cl_page_slice *slice)<br />
{<br />
</pre><br />
<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* breaks if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
* When allocating/freeing a single struct, use OBD_ALLOC_PTR() for clarity:<br />
<pre><br />
right:<br />
OBD_ALLOC_PTR(mds_body);<br />
OBD_FREE_PTR(mds_body);<br />
<br />
wrong:<br />
OBD_ALLOC(mds_body, sizeof(*mds_body));<br />
OBD_FREE(mds_body, sizeof(*mds_body));<br />
</pre><br />
<br />
* Do not embed operations inside assertions. If assertions are disabled for performance reasons this code will not be executed.<br />
<pre><br />
right:<br />
len = strcat(foo, bar);<br />
LASSERT(len > 0);<br />
<br />
wrong:<br />
LASSERT(strcat(foo, bar) > 0);<br />
</pre><br />
<br />
* Messages on the console (''CERROR'', ''CWARN'', ''LCONSOLE_*'') should print the OBD device name or filesystem name where the error is hit, since there are usually multiple targets running on a single server. The error messages should also include enough information to make some useful diagnosis of the problem (e.g. FID and/or filename, client NID, etc). Otherwise, there is little value in having printed the error, but then having to try and reproduce the problem to diagnose it:<br />
<pre><br />
right:<br />
LCONSOLE_INFO("MDS %s: %s now active, resetting orphans\n",<br />
obd->obd_name, obd_uuid2str(uuid));<br />
LCONSOLE_WARN("%s: new disk, initializing\n", obd->obd_name);<br />
CERROR("%s: unsupported incompat filesystem feature(s) %x\n", obd->obd_name,<br />
<br />
wrong:<br />
CERROR("Cannot get thandle\n");<br />
CERROR("NULL bitmap!\n");<br />
CERROR("invalid event\n");<br />
</pre><br />
<br />
* Error messages that print a numeric error value should print it at the end of the line in a consistent format:<br />
<pre><br />
right:<br />
CERROR("%s: error invoking upcall %s %s %s: rc = %d",<br />
CERROR("%s: cannot open/create O: rc = %d\n", obd->obd_name,rc);<br />
<br />
wrong:<br />
CERROR("err %d on param '%s'\n", rc, ptr);<br />
CERROR("Can't get index (%d)\n", rc);<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=12439Coding Guidelines2012-06-26T23:04:39Z<p>Adilger: /* Whitespace */ change coding guidelines over to tabs</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
== Beautiful Code == <br />
<br />
''A note from Eric Barton, a Lustre pioneer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''rq_reqmsg'' and ''rq_repmsg'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* Tabs should be used in all 2.3 and later lustre/, lnet/ and libcfs/ files. This matches the upstream Linux kernel coding style, and is the default method of code indentation.<br />
<br />
* '''NOTE NOTE NOTE''' The use of tabs for indentation is a reversal from previous Lustre coding guidelines, since May 2012 and Lustre 2.3. This is being done in order to facilitate code integration with the Linux kernel. All new patches should be submitted using tabs for ALL modified lines in the patch. If there are 6 or fewer lines using spaces for indentation between two lines changed by a patch, then all of the intervening lines should also have the indentation changed to use tabs. Similarly, if there are only a handful of lines at the start or end a modified function or test that are still using spaces for indentation, convert all of the lines in that function or test to use tabs. In this manner, we can migrate consistent chunks of code over to tabs without having a 250kLOC patch breaking the commit history of every line of code, and also avoid breaking code that is in existing branches/patches that still need to merge. In a year or so, lines that still use spaces (i.e. those that are not under active development) will be converted to using tabs for indentation as well.<br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* breaks if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
* When allocating/freeing a single struct, use OBD_ALLOC_PTR() for clarity:<br />
<pre><br />
right:<br />
OBD_ALLOC_PTR(mds_body);<br />
OBD_FREE_PTR(mds_body);<br />
<br />
wrong:<br />
OBD_ALLOC(mds_body, sizeof(*mds_body));<br />
OBD_FREE(mds_body, sizeof(*mds_body));<br />
</pre><br />
<br />
* Do not embed operations inside assertions. If assertions are disabled for performance reasons this code will not be executed.<br />
<pre><br />
right:<br />
len = strcat(foo, bar);<br />
LASSERT(len > 0);<br />
<br />
wrong:<br />
LASSERT(strcat(foo, bar) > 0);<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=12324Lustre Project List2012-01-10T12:24:27Z<p>Adilger: add mdd-survey link</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128] [http://jira.whamcloud.com/browse/LU-16 LU-16] (finished)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658] [http://jira.whamcloud.com/browse/LU-633 LU-633] (finished)<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] [http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833] [http://jira.whamcloud.com/browse/LU-5 LU-5] (finished)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268] [http://jira.whamcloud.com/browse/LU-80 LU-80] (finished)<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063] [http://jira.whamcloud.com/browse/LU-136 LU-136] (finished)<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526] [http://jira.whamcloud.com/browse/ORNL-6 ORNL-6] (finished)<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767] [http://jira.whamcloud.com/browse/LU-19 LU-19] (finished)<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. quotas to limit (or deny) specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634] [http://jira.whamcloud.com/browse/LU-938 LU-938] (work in progress)<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=12323Lustre Project List2012-01-09T23:22:28Z<p>Adilger: /* List of Lustre Features and Projects */ update status of some finished projects</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128] [http://jira.whamcloud.com/browse/LU-16 LU-16] (finished)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] [http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833] [http://jira.whamcloud.com/browse/LU-5 LU-5] (finished)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268] [http://jira.whamcloud.com/browse/LU-80 LU-80] (finished)<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063] [http://jira.whamcloud.com/browse/LU-136 LU-136] (finished)<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526] [http://jira.whamcloud.com/browse/ORNL-6 ORNL-6] (finished)<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767] [http://jira.whamcloud.com/browse/LU-19 LU-19] (finished)<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. quotas to limit (or deny) specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634] [http://jira.whamcloud.com/browse/LU-938 LU-938] (work in progress)<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=12296Lustre Project List2011-05-21T20:09:47Z<p>Adilger: /* List of Lustre Features and Projects */</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128] [http://jira.whamcloud.com/browse/LU-16 LU-16] (work in progress)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] [http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833] [http://jira.whamcloud.com/browse/LU-5 LU-5] (work in progress)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063] [http://jira.whamcloud.com/browse/LU-136 LU-136] (work in progress)<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526] [http://jira.whamcloud.com/browse/ORNL-6 ORNL-6] (work in progress)<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767] [http://jira.whamcloud.com/browse/LU-19 LU-19] (work in progress)<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=12295Lustre Project List2011-05-21T20:04:00Z<p>Adilger: /* List of Lustre Features and Projects */</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
[http://jira.whamcloud.com/browse/LU-16 LU-16] (work in progress)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217]<br />
[http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
[http://jira.whamcloud.com/browse/LU-5 LU-5] (work in progress)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
[http://jira.whamcloud.com/browse/LU-136 LU-136] work in progress<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526] work in progress<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
[http://jira.whamcloud.com/browse/LU-19 LU-19] work in progress<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=12294Lustre Project List2011-05-21T20:02:52Z<p>Adilger: </p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|[http://jira.whamcloud.com/browse/LU-16 LU-16] (work in progress)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217]<br />
|[http://jira.whamcloud.com/browse/LU-18 LU-18] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|[http://jira.whamcloud.com/browse/LU-5 LU-5] (work in progress)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|[http://jira.whamcloud.com/browse/LU-136 LU-136] work in progress<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526] work in progress<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|[http://jira.whamcloud.com/browse/LU-19 LU-19] work in progress<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11996Lustre Project List2010-12-10T07:28:31Z<p>Adilger: /* List of Lustre Features and Projects */ add default OST pools</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128] (work in progress)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Allow default OST pool<br />
|4<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24314 24314]<br />
|<small>Allow a filesystem-wide default OST pool to be specified. Currently, it is possible to set the default stripe count, size, index on a filesystem with "lfs setstripe" on the filesystem root, but the OST pool name is ignored. There is no other mechanism to specify default the OST pool for all new files in the filesystem, if no pool is specified. This would be useful for WAN or other heterogeneous OST configurations.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833] (work in progress)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11995Lustre Project List2010-12-07T22:23:14Z<p>Adilger: /* List of Lustre Features and Projects */ add WIP status to a few projects</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128] (work in progress)<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24217 24217] (work in progress)<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833] (work in progress)<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128] (work in progress)<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11900Lustre Project List2010-12-03T10:24:00Z<p>Adilger: add large EA handling</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Finish large EA handling for ldiskfs<br />
|4<br />
|ldiskfs<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24268 24268]<br />
|<small>Finish off the large EA handling in ldiskfs, and get this code accepted upstream.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128]<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. ACLs to only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID) and file parameters (name, extension, etc).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11878Lustre Project List2010-11-10T10:04:52Z<p>Adilger: /* List of Lustre Features and Projects */ add OST replacement</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Online OST replacement<br />
|4<br />
|OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=24128 24128]<br />
|<small>Allow a new OST to replace a previous OST at the same index, in case of hardware replacement or unrecoverable filesystem corruption.</small><br />
|-<br />
|Implement a distributed snapshot mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11866Lustre Project List2010-09-14T06:09:56Z<p>Adilger: /* List of Lustre Features and Projects */ add bug for testing efficiency</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=23051 23051]<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Implement a distributed checksum mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11863Lustre Project List2010-09-09T23:16:56Z<p>Adilger: /* List of Lustre Features and Projects */</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 2TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|Over 16TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Implement a distributed checksum mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11862Lustre Project List2010-09-09T22:29:11Z<p>Adilger: /* List of Lustre Features and Projects */ add remaining features</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Over 4TB objects<br />
|3<br />
|RPC, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20128 20128]<br />
|<small>Support objects larger than 2TB in size. Currently the client assumes that the largest possible object size is 2TB, but this limit should be returned from the OST at connect time.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|kernel patch removal<br />
|3<br />
|MDS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21524 21524]<br />
|<small>Remove Lustre kernel patches to allow Lustre servers to be more easily ported to new kernels, and to be built against vendor kernels without changing the vendor kernel RPMs. There are a number of different patches, each one needs to use equivalent functionality which already exists in the kernel, or work to get the patch accepted upstream. See also ldiskfs patch removal</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|fallocate() API<br />
|3<br />
|VFS, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15064 15064]<br />
|<small>Add client interface and RPC to allow space reservation for objects on OSTs; sys_fallocate() exists on clients since RHEL5.4 and in ext4-based ldiskfs.</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Large Readdir RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Implement a distributed checksum mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|ldiskfs patch cleanup<br />
|5<br />
|ext4, OST, MDT<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21635 21635]<br />
|<small>A number of the ldiskfs patches should be cleaned up, or possibly removed entirely so that ongoing patch updates against new kernels is simplified.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available. Could potentially be extended to do object readahead instead of simply a size glimpse if a lookup-stat-read pattern was detected.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Client PAGE_SIZE < server PAGE_SIZE<br />
|6<br />
|RPC, LNET<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=686 686]<br />
|<small>Support smaller page sizes on client than server. Applies to exotic server HW like PPC/ia64/SPARC.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|Directory readdir+<br />
|7<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17845 17845]<br />
|<small>Bulk metadata readdir/stat interface to speed up "ls -l" operations. Send back requested inode attributes for all directory entries as part of the extended dirent data. Integrate with any proposed API for this on the client. Needs Large Readdir RPCs to be efficient over the wire, since more data will be returned for every entry.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|-<br />
|Small file IO aggregation<br />
|7<br />
|CLIO, OST<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=944 944]<br />
|<small>Small file IO aggregation (multi-object RPCs), most likely for writes first, and possibly later for reads in conjunction with statahead.</small><br />
|-<br />
|Version Based Recovery for delayed clients<br />
|8<br />
|recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=10609 10609]<br />
|<small>Complete VBR implementation to handle delayed client recovery/reconnection. Needed for disconnected network operation, better fault tolerance.</small><br />
|-<br />
|Client-side data encryption<br />
|9<br />
|VM, security<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Encrypt files and directories (or possibly just filenames) on the client before sending to the server. This avoids sending unencrypted data over the network, or ever having the data in plaintext on the server (in case of separate decryption from network, encryption on disk).</small><br />
|-<br />
|Ptlrpc layer rewrite<br />
|9<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5286 5286]<br />
|<small>Rewrite the Lustre RPC code to clean up the code and simplify RPC handling.</small><br />
|-<br />
|local object zero-copy IO<br />
|9<br />
|VFS, DLM, OST<br />
|<br />
|<small>Efficient data IO between a client and a local OST object; optimization to support local clients. Likely implemented as a fast-path connection between the OSC and the local OFD/OSD. Read cache should be kept on the OSD instead of at the client VFS level, so that the cache can be shared among all users of this OST.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11861Lustre Project List2010-09-09T21:36:20Z<p>Adilger: /* List of Lustre Features and Projects */ add more features</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. As a basic step, printing the duration of each test script in the acceptance-small.sh test summary would tell us where the testing time is being spent.<br />
<br />
More advanced work includes improved test scheduling, dynamic cluster configuration to allow more efficient utilization of available test nodes. Virtual machines could be used for functional tests instead of real nodes.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done. Being able to dump the config log to a plain text file, edit it, and then restore it would make administration considerably easier.</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server).</small><br />
|-<br />
|Allow 100k open files on a single client<br />
|4<br />
|client, MDS<br />
|<br />
|<small>Allow 100k open files per client. Fix client to not store committed open RPCs in the resend list but instead reopen files from the file handles upon recovery (see Simplified Interop) to avoid O(n) behaviour when adding new RPCs to the RPCs-for-recovery list on the client. Fix MDS to store "mfd" in a hash table instead of a linked list to avoid O(n) behaviour when searching for an open file handle. For debugging it would be useful to have a /proc entry on the MDS showing the open FIDs for each client export.</small><br />
|-<br />
|Error message improvements<br />
|4<br />
|core, operations<br />
|<br />
|<small>Review and improve the Lustre error messages to be more useful. A larger project is to change the core Lustre error message handling to generate better structured error messages so that they can be parsed/managed more easily.</small><br />
|-<br />
|Client under memory pressure<br />
|4<br />
|client, VFS, MM<br />
|<br />
|<small>Fix client to work well under memory pressure, to avoid deadlocks during allocation and be able to continue processing RPCs, reduce caches, free memory. This is a prerequisite for swap-on-Lustre.</small><br />
|-<br />
|Readdir with large read RPCs<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|4<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).</small><br />
|-<br />
|Client subdirectory mounts<br />
|4<br />
|VFS, MDS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=15267 15276]<br />
|<small>Mount a subdirectory of a filesystem from the client instead of the root.</small><br />
|-<br />
|Implement a distributed checksum mechanism<br />
|5<br />
|MDS, OST, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=14124 14124]<br />
|<small>Implement distributed snapshot mechanism; initially with only loosely synchronized operations (possibly ordered between MDS and OSS), or blocking whole fileystem while consistent snapshot is created. After the snapshot has been created, modify the fsname of the MDT and OSTs so that it can be mounted separately.</small><br />
|-<br />
|Improve QOS Round-Robin object allocator<br />
|5<br />
|MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Improve LOV QOS allocator to always do weighted round-robin allocation, instead of degrading into weighted random allocations once the OST free space becomes imbalanced. This evens out allocations continuously, avoids crazy/bad OST allocation imbalances when QOS becomes active, and allows adding weighting for things like current load, OST RAID rebuild, etc.</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI or inode lookups in most cases.</small><br />
|-<br />
|Readdir Object Statahead<br />
|5<br />
|VFS, DLM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18526 18526]<br />
|<small>Enhancement of current statahead to do object glimpse asynchronously once inode stathead has returned layout information. Preferred solution is readdir+ or SOM, but this could help in the short term, and would still be useful for open files and does not affect the network protocol so could be removed when those features are available.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|-<br />
|Simplified Interoperability<br />
|6<br />
|RPC, VFS<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18496 18496]<br />
|<small>Clean up client state before server upgrade to minimize or eliminate the need to have message format interoperability. The client only needs to track open files, and all other state (locks, cached pages, etc) can be dropped and re-fetched as needed from the server. Change client recovery to re-open files from open file handles instead of from saved RPCs.</small><br />
|-<br />
|Enhanced OST Pools Support<br />
|6<br />
|MDS, LOV<br />
|<br />
|<small>Improve OST pools support to allow mandatory OST enforcement (i.e. only allow specific users to access certain pools, including the default "all OSTs" pool), more complex policy specification (e.g. select a fallback pool on ENOSPC). Allow default initial file placement policies (e.g., server pool, stripe width) to be defined based on cluster membership (NID, UID, GID).</small><br />
|-<br />
|Replay Signatures<br />
|6<br />
|RPC, recovery<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18547 18547]<br />
|<small>Allow MDS/OSS to determine if client can legitimately replay an RPC, by digitally signing it at processing time and verifying the signature at replay time.</small><br />
|-<br />
|Network Request Scheduler (NRS)<br />
|6<br />
|RPC, OST, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13634 13634]<br />
|<small>Order IO (and possibly metadata) requests by client, file offset, priority, etc in order to improve overall back-end efficiency and/or provide QOS to clients. Dynamically change the number of RPCs in flight for each client to balance the RPC traffic at the server. Previous research done by Sun shows this can significantly improve overall performance.</small><br />
|-<br />
|Lustre Block Device<br />
|6<br />
|VFS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Lustre object lloop driver exports block device to userspace, bypassing filesystem. Code partly works and is part of 1.6.4+, but has correctness issues and potential performance problems. It needs to be ported to newer kernels.</small><br />
|-<br />
|Swap on Lustre<br />
|7<br />
|VFS, VM<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=5498 5498]<br />
|<small>Depends on the Lustre block device. Has problems when working under memory pressure, which makes it mostly useless until those problems are fixed.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|7<br />
|HSM, MDS, LOV<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite). The HSM project implements layout lock support and policy engine for automatic space management. An ioctl that allows transparently changing an MDS inode to point to the migrated object(s) instead of the original object(s) and then scheduling the old object(s) for destruction.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11859Lustre Project List2010-09-06T01:02:46Z<p>Adilger: /* List of Lustre Features and Projects */ reference lustre-devel</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
After you have chosen a project, or if you are having trouble deciding what to work on, please contact the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to discuss your project with the Lustre developers. That will ensure that the work you are doing is in line with other plans/projects for Lustre and also to ensure that nobody else is working on the same thing.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. More advanced work includes improved test scheduling. Dynamic cluster configuration. Virtual machines for functional tests.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done.</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server)</small><br />
|-<br />
|Readdir with large requests<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|5<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).<br />
</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI lookups in most cases.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|6<br />
|HSM, layout<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite); HSM project implements layout lock support and policy engine for automatic space management.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11858Lustre Project List2010-09-05T22:42:29Z<p>Adilger: /* List of Lustre Features and Projects */ add new</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|ioctl() number cleanups<br />
|1<br />
|kernel<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20731 20731]<br />
|<small>Clean up Linux IOC numbering to properly use "size" field so that mixed 32- and 64-bit kernel/userspace ioctls work correctly. Attention needs to be paid to maintaining userspace compatibility for a number of releases, so the old ioctl() numbers cannot simply be removed.</small><br />
|-<br />
|Improve testing Efficiency<br />
|3<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of the acceptance-small.sh test scripts. More advanced work includes improved test scheduling. Dynamic cluster configuration. Virtual machines for functional tests.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done.</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server)</small><br />
|-<br />
|Readdir with large requests<br />
|4<br />
|MDS, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17833 17833]<br />
|<small>Read directory pages in large chunks instead of the current page-at-a-time reads from the client. This will improve readdir performance somewhat, and reduce load on the MDS. It is expected to be significant over WAN high-latency links.</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|5<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).<br />
</small><br />
|-<br />
|All RPCs pass a lock handle<br />
|5<br />
|DLM, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=22849 22849]<br />
|<small>For protocol correctness, and improved performance, it would be desirable for all RPCs that are done with a client lock held to send the lock handle along with the request. For OST requests this means all read, write, truncate operations (unless "lockless") should include a lock handle. This allows the OST to validate the request is being done by a client that holds the correct locks, and allows lockh->lock->object lookups to avoid OI lookups in most cases.</small><br />
|-<br />
|OST Space Management (Basic)<br />
|6<br />
|HSM, layout<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=13107 13107]<br />
|<small>Simple migration capability - transparently migrate objects/files between OSTs (blocking application writes, or aborting migration during contention); evacuate OSTs and move file data to other OSTs; add new OST and balance data on it. The OST doesn't really need to understand this, only the MDS (for LOV EA rewrite) and client (LOV EA rewrite); HSM project implements layout lock support and policy engine for automatic space management.</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11857Lustre Project List2010-09-05T22:25:07Z<p>Adilger: /* List of Lustre Features and Projects */ add new</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
{| border=2 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|Improve testing Efficiency<br />
|1<br />
|shell, test<br />
|<br />
|<small>Improve the performance, efficiency, and coverage of acc-sm. Improved test scheduling. Dynamic cluster configuration. Virtual machines for functional tests.</small><br />
|-<br />
|Config save/edit/restore<br />
|3<br />
| MGS, llog, config<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=17094 17094]<br />
|<small>Need to be able to backup/edit/restore the client/MDS/OSS config llog files after a writeconf. One reason is for config recovery if the config llog becomes corrupted. Another reason is that all of the filesystem tunable parameters (including all of the OST pool definitions) are stored in the config llog and are lost if a writeconf is done.</small><br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server)</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|5<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).<br />
</small><br />
|-<br />
|Imperative recovery<br />
|6<br />
|recovery, RPC<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=18767 18767]<br />
|<small>Reduce recovery time by having the server notify clients after recovery has completed instead of waiting for the client to timeout the RPC before it begins recovery.</small><br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Finding_a_Project&diff=11856Finding a Project2010-09-05T14:13:13Z<p>Adilger: /* Selecting a Project to Enhance Lustre */</p>
<hr />
<div><small>''(Updated: Nov 2009)''</small><br />
__TOC__<br />
This page describes how to [[#Finding a Bug to Fix|find a bug to fix]], [[#Selecting a Project to Enhance Lustre|select a project to enhance Lustre™]], [[#Helping with Lustre Testing|help with Lustre testing]], or [[#Contributing to Lustre User Documentation|contribute to the Lustre user documentation]]. Lustre defects and features or to-do items are logged in the Bugzilla bug tracking system. <br />
<br />
You can also also contact the [[Lustre_Mailing_Lists|Lustre Development mailing list]] (often referred to as [mailto:lustre-devel@lists.lustre.org lustre-devel]) to discuss ideas for projects that match your skills and interests. Note that Lustre encompasses a number of development areas, including user tools, documentation, disk filesystems, networking, kernel integration, etc., so you can almost always find a project that is interesting and challenging. (For a user documentation project, submit your idea to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].)<br />
<br />
<br />
Having a specific problem to fix requires an understanding of the flow of operations and how that maps to specific code. It gives you a concrete goal that provides a context for investigating the code, rather than just reading vaguely through the vast Lustre code base. <br />
<br />
Once you have selected a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel] to discuss the best approach to take and to keep others aware of what you are working on. <br />
<br />
== Finding a Bug to Fix ==<br />
Fixing bugs in Lustre is a good way to become familiar with the Lustre code if you've not worked on it before. Some ways to find a bug you'd like to work on are:<br />
<br />
* ''Search [https://bugzilla.lustre.org/buglist.cgi?query_format=advanced&product=Lustre&keywords_type=anywords&keywords=easy+needs-test&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&order=Reuse+same+sort+as+last+time Bugzilla] key words for bugs designated "easy" bugs.'' Some Lustre developers use this keyword to indicate that a bug could be fixed by someone without in-depth familiarity with the Lustre code.<br />
* ''Search [https://bugzilla.lustre.org/query.cgi Bugzilla] for very old Lustre bugs.'' These are typically non-critical bugs that are not dependent on a release timeline. They can vary widely in complexity. In particular, doing an empty Bugzilla query and looking at the first 100 items (sorted by bug number) shows a lot of bugs that are either relatively hard to reproduce, not generally visible to users, or "nice-to-have" features that no customer has specifically prioritized to be fixed.<br />
<br />
== Selecting a Project to Enhance Lustre ==<br />
If you'd like to take on a project to enhance or add a new feature to Lustre, consider one of these options:<br />
<br />
* Pick a project from the [[Lustre Project List]]. For guidance in selecting or proceeding with a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel].<br />
<br />
* Ask for a project on [mailto:lustre-devel@lists.lustre.org lustre-devel]. This mailing list is read by many of the Lustre developers and is a good place for questions, ideas, feedback.<br />
<br />
* ''Assist with keeping Lustre up-to-date with recent kernel changes.'' Porting Lustre to newer kernel versions is an ongoing effort, given the large number of vendor and upstream kernel releases. For some changes, a simple fix to the Lustre code will be required, while for others, a good understanding of the Linux kernel and how Lustre interfaces with it is needed.<br />
* ''Propose a new feature that can be developed as a separate module on top of Lustre.'' Be sure to get feedback on your proposal by contacting [mailto:lustre-devel@lists.lustre.org lustre-devel] before you get started.<br />
<br />
== Helping with Lustre Testing ==<br />
Testing Lustre under a variety of workloads is always of interest. The more unusual the IO pattern used by a benchmark, application, or testing tool, the more likely it is to find something of interest.<br />
<br />
To find out how you can contribute to the testing of upcoming Lustre releases, see [[Lustre_Test_Plans|Lustre Test Plans]].<br />
<br />
== Contributing to Lustre User Documentation ==<br />
You are invited to contribute to the [http://wiki.lustre.org Lustre wiki] or the [[Lustre_Documentation|''Lustre Operations Manual'']]. <br />
<br />
Contribute to the Lustre wiki by:<br />
* Sending your enhancement request or description of a defect to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
* Creating content for a new topic for the Lustre wiki and submitting it to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
<br />
Contribute to the ''Lustre Operations Manual'' by:<br />
* Sending your enhancement request or description of a defect to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].<br />
* Writing content to fulfill an enhancement request. You can find a project by searching [https://bugzilla.lustre.org/query.cgi Bugzilla] using the search criteria "Documentation" and "Manual Topics". Submit your content to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=ZFS_and_Lustre&diff=11855ZFS and Lustre2010-09-05T14:12:37Z<p>Adilger: </p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
__TOC__<br />
The Lustre™ node file system ''ldiskfs'' (based on ext3/ext4) is limited to an 8 TB maximum file system size and offers no guarantee of data integrity. To improve the reliability and resilience of the underlying file system on the OSS and MDS components, Lustre will add ZFS support.<br />
<br />
Lustre supporting ZFS will offer a number of advantages, such as improved data integrity with transaction-based, copy-on-write operations and an end-to-end checksum on every block.<br />
<br />
Copy-on-write means that ZFS never overwrites existing data. Changed information is written to a new block and the block pointer to in-use data is only moved after the write transaction is completed. This mechanism is used all the way up to the file system block structure at the top block.<br />
<br />
To avoid data corruption, ZFS performs end-to-end checksumming. The checksum is not stored with the data block, but rather in the pointer to the block. All checksums are done in server memory, so errors not caught by other file systems are detected in ZFS, such as:<br />
* Phantom writes, where the write is dropped on the floor.<br />
* Misdirected reads or writes, where the disk accesses the wrong block.<br />
* DMA parity errors between the array and server memory or from the driver, since the checksum validates data inside the array.<br />
* Driver errors, where data winds up in the wrong buffer inside the kernel.<br />
* Accidental overwrites, such as swapping to a live file system.<br />
<br />
In Lustre, ZFS checksumming will be done by the Lustre client on the application node. This will detect any data corruption introduced into the network between the application node and the disk drive in the Lustre storage system.<br />
<br />
Previous testing of Lustre with network checksums has resulted in the detection of previously unknown corruption in network cards. These cards silently introduced data corruption that went undetected without the use of checksums. It should be noted that the checksum computation does consume some processor cycles, approximately 1 GHz of CPU time to process 500 MB/sec of I/O.<br />
<br />
''An implementation note:'' Previously, ZFS support was being developed and tested with a user space implementation of the ZFS DMU. Currently, we are running the DMU in kernel space. Also, the Lustre DMU code is almost entirely common with the Solaris version of ZFS, so Lustre support for ZFS will closely parallel the Solaris release of ZFS.<br />
<br />
Lustre support of ZFS will offer several specific advantages:<br />
* ''Self-healing capability'' - In a mirrored or RAID configuration, ZFS not only detects data corruption, but it automatically corrects the bad data.<br />
* ''Improved administration'' - Because ZFS detects and reports data corruption on all read and write errors at the block level, it is easier for system administrators to quickly identify which hardware components are corrupting data. ZFS also has very easy-to-use command-line administration utilities.<br />
* ''SSD support'' - ZFS supports the addition of high-speed I/O devices, such as SSDs, to the storage pool. The Read Cache Pool or L2ARC acts as a cache layer between memory and the disk. This support can substantially improve the performance of random read operations. SSDs can also be used to improve synchronous write performance, by adding them to the pool as log devices. You can add as many SSDs to your storage pool as you need to increase your read cache size and IOPS, your synchronous write IOPS, or both.<br />
* ''Scalability'' - ZFS is a 128-bit file system. This means that current restrictions on maximum-size file systems for a single MDT or OST, maximum stripe size, andmaximum size of a single file will be removed. ZFS support will also remove the current 16 TB limitation on LUNs.<br />
<br />
For more general information about ZFS, see [[ZFS Resources]].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Features&diff=11854Lustre Features2010-09-05T14:10:10Z<p>Adilger: moved Lustre Features to Lustre Project List:&#32;Don't want to confuse this with existing features</p>
<hr />
<div>#REDIRECT [[Lustre Project List]]</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11853Lustre Project List2010-09-05T14:10:10Z<p>Adilger: moved Lustre Features to Lustre Project List:&#32;Don't want to confuse this with existing features</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
{| border=1 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server)</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|5<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).<br />
</small><br />
|-<br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Clustered_Metadata&diff=11852Clustered Metadata2010-09-05T13:38:37Z<p>Adilger: /* Recovery */ mention epochs at the end instead of the beginning</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
__TOC__<br />
This document describes the design of the clustered metadata handling<br />
for Lustre™. This material depends on other Lustre designs, such as: <br />
<br />
* General recovery<br />
* Orphan Recovery<br />
* Metadata Write Back caching<br />
<br />
For a draft of the design document, see [[Media:HPCS_CMD_06_15_09.pdf|''Clustered Metadata Design'']].<br />
<br />
== Introduction ==<br />
<br />
Overall, the clustered metadata handling is structured as follows: <br />
<br />
* A cluster of metadata servers manage a collection of inode groups. Each inode group is a Lustre device exporting the usual metadata API augmented with a few operations specifically crafted for metadata clustering. We call these collections of inodes inode groups.<br />
* Directory formats for file systems used on the MDS devices are changed to allow directory entries to contain an inode group and identifier of the inode.<br />
* A logical clustered metadata driver is introduced below the client Lustre file system write back cache driver that maintains connections with the MDS servers.<br />
* A single metadata protocol is used by the client file system to make updates on the MDSs and by the MDSs to make updates involving other MDSs.<br />
* A single recovery protocol is used by the clients - MDS and MDS-MDS service.<br />
* Directories can be split across multiple MDS nodes. In this case, a primary MDS directory inode contains an extended attribute that points at other MDS inodes, which we call directory objects.<br />
<br />
== Configuration management and startup ==<br />
<br />
The configuration will name an MDS server, and optionally a failover<br />
node, which hold the root inode for a fileset. Clients will contact<br />
that MDS for the root inode during mount, as they do already. <br />
<br />
They will also fetch from it a clustering descriptor. The clustering<br />
descriptor contains a header, and an array lists which inode groups are<br />
served by which server.<br />
<br />
Through normal mechanisms, clients will wait and probe for available<br />
metadata servers, during startup and cluster transitions. When new<br />
servers are found or configurations have changed, they can update their<br />
clustering descriptor as they update the LOV striping descriptor for<br />
OSTs.<br />
<br />
== Data Structures ==<br />
<br />
The ''fid'' contains a new 32 bit integer to name the inode group. <br />
<br />
Directory inodes on the MDS, when large, contain a new EA which is a<br />
descriptor of how the directory is split over directory objects,<br />
residing on other MDSs. This EA is subject to ordinary concurrency<br />
control by the MDS holding the inode. The EA is virtually identical<br />
to the LOV EA.<br />
<br />
== The clustered metadata client (CMC) ==<br />
<br />
The function of the CMC is to figure out from the command issued which MDC to use. This is based on: <br />
* The inode groups in the request<br />
* A hash value of names used in the request, combined with the EA of a primary inode involved in the request<br />
* For ''readdir'', the directory offset combined with the EA of the primary inode<br />
* The clustering descriptor<br />
<br />
In any case, every command is dispatched to a single metadata server and<br />
the clients will not engage more than one metadata server for a single<br />
request. <br />
<br />
The API changes here are minimal and the client part of the implementation is trivial.<br />
<br />
== MDS implementation ==<br />
<br />
For the most part, operations are similar or identical to what they were before. In some cases, multiple MDS servers are involved in updates.<br />
<br />
''getattr'', ''open'', ''readdir'', ''setattr'' and ''lookup'' methods are unaffected. <br />
<br />
Methods adding entries to directories are modified in some cases: <br />
<br />
* ''mkdir'' always creates the new directory on another MDS.<br />
* ''unlink'', ''rmdir'', and ''rename'' may involve more than one MDS.<br />
* For ''large directories'', all operations making updates to directories can cause a [[#Directory_Split|directory split]].<br />
* For ''other operations'', if no splits in large directories are encountered, all other operations proceed as they are executed on one MDS.<br />
<br />
=== Directory Split ===<br />
<br />
A directory can be striped over several MDTs as files over several OSTs. Then the directory will be split into several objects and each one will be located in different MDTs. The layout information(stripe EA) will be stored in the extend attributes of all split objects.<br />
<br />
== Recovery ==<br />
<br />
Initially, metadata operations that span multiple MDSs (MDTs) will be ordered and synchronous to simplify recovery from a system crash. This may impact the performance of operations involving several MDTs. Also, an inode leak may occur after MDS recovery, but only in such a way that data is never lost. These leaked inodes will be deleted by lfsck verification of the MDT filesystems.<br />
<br />
In the long term, CMD recovery will rely on global epochs, which will allow distributed asynchronous updates to multiple MDTs.<br />
<br />
== Locking ==<br />
<br />
We believe locking can be done in ''fid'' order as it is currently done on the MDS.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Finding_a_Project&diff=11851Finding a Project2010-09-05T13:22:39Z<p>Adilger: /* Selecting a Project to Enhance Lustre */ add Lustre Features wiki page</p>
<hr />
<div><small>''(Updated: Nov 2009)''</small><br />
__TOC__<br />
This page describes how to [[#Finding a Bug to Fix|find a bug to fix]], [[#Selecting a Project to Enhance Lustre|select a project to enhance Lustre™]], [[#Helping with Lustre Testing|help with Lustre testing]], or [[#Contributing to Lustre User Documentation|contribute to the Lustre user documentation]]. Lustre defects and features or to-do items are logged in the Bugzilla bug tracking system. <br />
<br />
You can also also contact the [[Lustre_Mailing_Lists|Lustre Development mailing list]] (often referred to as [mailto:lustre-devel@lists.lustre.org lustre-devel]) to discuss ideas for projects that match your skills and interests. Note that Lustre encompasses a number of development areas, including user tools, documentation, disk filesystems, networking, kernel integration, etc., so you can almost always find a project that is interesting and challenging. (For a user documentation project, submit your idea to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].)<br />
<br />
<br />
Having a specific problem to fix requires an understanding of the flow of operations and how that maps to specific code. It gives you a concrete goal that provides a context for investigating the code, rather than just reading vaguely through the vast Lustre code base. <br />
<br />
Once you have selected a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel] to discuss the best approach to take and to keep others aware of what you are working on. <br />
<br />
== Finding a Bug to Fix ==<br />
Fixing bugs in Lustre is a good way to become familiar with the Lustre code if you've not worked on it before. Some ways to find a bug you'd like to work on are:<br />
<br />
* ''Search [https://bugzilla.lustre.org/buglist.cgi?query_format=advanced&product=Lustre&keywords_type=anywords&keywords=easy+needs-test&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&order=Reuse+same+sort+as+last+time Bugzilla] key words for bugs designated "easy" bugs.'' Some Lustre developers use this keyword to indicate that a bug could be fixed by someone without in-depth familiarity with the Lustre code.<br />
* ''Search [https://bugzilla.lustre.org/query.cgi Bugzilla] for very old Lustre bugs.'' These are typically non-critical bugs that are not dependent on a release timeline. They can vary widely in complexity. In particular, doing an empty Bugzilla query and looking at the first 100 items (sorted by bug number) shows a lot of bugs that are either relatively hard to reproduce, not generally visible to users, or "nice-to-have" features that no customer has specifically prioritized to be fixed.<br />
<br />
== Selecting a Project to Enhance Lustre ==<br />
If you'd like to take on a project to enhance or add a new feature to Lustre, consider one of these options:<br />
<br />
* Pick a project from the [[Lustre Features]] list. For guidance in selecting or proceeding with a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel].<br />
<br />
* Ask for a project on [mailto:lustre-devel@lists.lustre.org lustre-devel]. This mailing list is read by many of the Lustre developers and is a good place for questions, ideas, feedback.<br />
<br />
* ''Assist with keeping Lustre up-to-date with recent kernel changes.'' Porting Lustre to newer kernel versions is an ongoing effort, given the large number of vendor and upstream kernel releases. For some changes, a simple fix to the Lustre code will be required, while for others, a good understanding of the Linux kernel and how Lustre interfaces with it is needed.<br />
* ''Propose a new feature that can be developed as a separate module on top of Lustre.'' Be sure to get feedback on your proposal by contacting [mailto:lustre-devel@lists.lustre.org lustre-devel] before you get started.<br />
<br />
== Helping with Lustre Testing ==<br />
Testing Lustre under a variety of workloads is always of interest. The more unusual the IO pattern used by a benchmark, application, or testing tool, the more likely it is to find something of interest.<br />
<br />
To find out how you can contribute to the testing of upcoming Lustre releases, see [[Lustre_Test_Plans|Lustre Test Plans]].<br />
<br />
== Contributing to Lustre User Documentation ==<br />
You are invited to contribute to the [http://wiki.lustre.org Lustre wiki] or the [[Lustre_Documentation|''Lustre Operations Manual'']]. <br />
<br />
Contribute to the Lustre wiki by:<br />
* Sending your enhancement request or description of a defect to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
* Creating content for a new topic for the Lustre wiki and submitting it to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
<br />
Contribute to the ''Lustre Operations Manual'' by:<br />
* Sending your enhancement request or description of a defect to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].<br />
* Writing content to fulfill an enhancement request. You can find a project by searching [https://bugzilla.lustre.org/query.cgi Bugzilla] using the search criteria "Documentation" and "Manual Topics". Submit your content to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Project_List&diff=11850Lustre Project List2010-09-05T13:21:35Z<p>Adilger: Initial entry</p>
<hr />
<div>== List of Lustre Features and Projects ==<br />
<br />
Below is a list of Lustre features and projects that are just waiting for someone to start working on them. They are listed roughly in order of increasing complexity, but this is highly dependent upon the coding skills of the developer and their familiarity with the Lustre code base.<br />
<br />
{| border=1 cellpadding=0<br />
|-<br />
!Feature<br />
!Complexity<br />
!Required skills<br />
!Tracking Bug<br />
!Brief Description<br />
|-<br />
|mdd-survey tools for performance analysis<br />
|3<br />
|obdfilter-survey, mdd, benchmarking<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=21658 21658]<br />
|<small>Add a low-level metadata unit test to allow measuring performance of the metadata stack without having connected clients, similar and/or integrated to the obdfilter survey (echo client, echo server)</small><br />
|-<br />
|32TB ldiskfs filesystems<br />
|5<br />
|ldiskfs, obdfilter<br />
|[https://bugzilla.lustre.org/show_bug.cgi?id=20063 20063]<br />
|<small>Single OST sizes larger than 16TB. This is largely supported in newer ext4 filesystems (e.g. RHEL5.4, RHEL6), but thorough testing and some bug fixing work may be needed in obdfilter (1.8, 2.0) or OFD (2.x), and other work may be needed in client (all versions).<br />
</small><br />
|-<br />
|}</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Finding_a_Project&diff=11844Finding a Project2010-08-30T19:57:17Z<p>Adilger: /* Selecting a Project to Enhance Lustre */ combine kernel porting tasks, remove "small-project" keyword (which doesn't exist)</p>
<hr />
<div><small>''(Updated: Nov 2009)''</small><br />
__TOC__<br />
This page describes how to [[#Finding a Bug to Fix|find a bug to fix]], [[#Selecting a Project to Enhance Lustre|select a project to enhance Lustre™]], [[#Helping with Lustre Testing|help with Lustre testing]], or [[#Contributing to Lustre User Documentation|contribute to the Lustre user documentation]]. Lustre defects and features or to-do items are logged in the Bugzilla bug tracking system. <br />
<br />
You can also also contact the [[Lustre_Mailing_Lists|Lustre Development mailing list]] (often referred to as [mailto:lustre-devel@lists.lustre.org lustre-devel]) to discuss ideas for projects that match your skills and interests. Note that Lustre encompasses a number of development areas, including user tools, documentation, disk filesystems, networking, kernel integration, etc., so you can almost always find a project that is interesting and challenging. (For a user documentation project, submit your idea to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].)<br />
<br />
<br />
Having a specific problem to fix requires an understanding of the flow of operations and how that maps to specific code. It gives you a concrete goal that provides a context for investigating the code, rather than just reading vaguely through the vast Lustre code base. <br />
<br />
Once you have selected a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel] to discuss the best approach to take and to keep others aware of what you are working on. <br />
<br />
== Finding a Bug to Fix ==<br />
Fixing bugs in Lustre is a good way to become familiar with the Lustre code if you've not worked on it before. Some ways to find a bug you'd like to work on are:<br />
<br />
* ''Search [https://bugzilla.lustre.org/buglist.cgi?query_format=advanced&product=Lustre&keywords_type=anywords&keywords=easy+needs-test&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&order=Reuse+same+sort+as+last+time Bugzilla] key words for bugs designated "easy" bugs.'' Some Lustre developers use this keyword to indicate that a bug could be fixed by someone without in-depth familiarity with the Lustre code.<br />
* ''Search [https://bugzilla.lustre.org/query.cgi Bugzilla] for very old Lustre bugs.'' These are typically non-critical bugs that are not dependent on a release timeline. They can vary widely in complexity. In particular, doing an empty Bugzilla query and looking at the first 100 items (sorted by bug number) shows a lot of bugs that are either relatively hard to reproduce, not generally visible to users, or "nice-to-have" features that no customer has specifically prioritized to be fixed.<br />
<br />
== Selecting a Project to Enhance Lustre ==<br />
If you'd like to take on a project to enhance or add a new feature to Lustre, consider one of these options:<br />
<br />
* Ask for a project on [mailto:lustre-devel@lists.lustre.org lustre-devel]. This list is read by many of the Lustre developers and is a good place for questions, ideas, feedback.<br />
<br />
* ''Assist with keeping Lustre up-to-date with recent kernel changes.'' Porting Lustre to newer kernel versions is an ongoing effort, given the large number of vendor and upstream kernel releases. For some changes, a simple fix to the Lustre code will be required, while for others, a good understanding of the Linux kernel and how Lustre interfaces with it is needed. For guidance in selecting or proceeding with a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel].<br />
<br />
* ''Propose a new feature that can be developed as a separate module on top of Lustre.'' Be sure to get feedback on your proposal by contacting [mailto:lustre-devel@lists.lustre.org lustre-devel] before you get started.<br />
<br />
== Helping with Lustre Testing ==<br />
Testing Lustre under a variety of workloads is always of interest. The more unusual the IO pattern used by a benchmark, application, or testing tool, the more likely it is to find something of interest.<br />
<br />
To find out how you can contribute to the testing of upcoming Lustre releases, see [[Lustre_Test_Plans|Lustre Test Plans]].<br />
<br />
== Contributing to Lustre User Documentation ==<br />
You are invited to contribute to the [http://wiki.lustre.org Lustre wiki] or the [[Lustre_Documentation|''Lustre Operations Manual'']]. <br />
<br />
Contribute to the Lustre wiki by:<br />
* Sending your enhancement request or description of a defect to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
* Creating content for a new topic for the Lustre wiki and submitting it to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
<br />
Contribute to the ''Lustre Operations Manual'' by:<br />
* Sending your enhancement request or description of a defect to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].<br />
* Writing content to fulfill an enhancement request. You can find a project by searching [https://bugzilla.lustre.org/query.cgi Bugzilla] using the search criteria "Documentation" and "Manual Topics". Submit your content to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Finding_a_Project&diff=11843Finding a Project2010-08-30T19:53:41Z<p>Adilger: /* Finding a Bug to Fix */ make "easy" bugzilla URL actually perform "easy" query in bugzilla</p>
<hr />
<div><small>''(Updated: Nov 2009)''</small><br />
__TOC__<br />
This page describes how to [[#Finding a Bug to Fix|find a bug to fix]], [[#Selecting a Project to Enhance Lustre|select a project to enhance Lustre™]], [[#Helping with Lustre Testing|help with Lustre testing]], or [[#Contributing to Lustre User Documentation|contribute to the Lustre user documentation]]. Lustre defects and features or to-do items are logged in the Bugzilla bug tracking system. <br />
<br />
You can also also contact the [[Lustre_Mailing_Lists|Lustre Development mailing list]] (often referred to as [mailto:lustre-devel@lists.lustre.org lustre-devel]) to discuss ideas for projects that match your skills and interests. Note that Lustre encompasses a number of development areas, including user tools, documentation, disk filesystems, networking, kernel integration, etc., so you can almost always find a project that is interesting and challenging. (For a user documentation project, submit your idea to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].)<br />
<br />
<br />
Having a specific problem to fix requires an understanding of the flow of operations and how that maps to specific code. It gives you a concrete goal that provides a context for investigating the code, rather than just reading vaguely through the vast Lustre code base. <br />
<br />
Once you have selected a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel] to discuss the best approach to take and to keep others aware of what you are working on. <br />
<br />
== Finding a Bug to Fix ==<br />
Fixing bugs in Lustre is a good way to become familiar with the Lustre code if you've not worked on it before. Some ways to find a bug you'd like to work on are:<br />
<br />
* ''Search [https://bugzilla.lustre.org/buglist.cgi?query_format=advanced&product=Lustre&keywords_type=anywords&keywords=easy+needs-test&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&order=Reuse+same+sort+as+last+time Bugzilla] key words for bugs designated "easy" bugs.'' Some Lustre developers use this keyword to indicate that a bug could be fixed by someone without in-depth familiarity with the Lustre code.<br />
* ''Search [https://bugzilla.lustre.org/query.cgi Bugzilla] for very old Lustre bugs.'' These are typically non-critical bugs that are not dependent on a release timeline. They can vary widely in complexity. In particular, doing an empty Bugzilla query and looking at the first 100 items (sorted by bug number) shows a lot of bugs that are either relatively hard to reproduce, not generally visible to users, or "nice-to-have" features that no customer has specifically prioritized to be fixed.<br />
<br />
== Selecting a Project to Enhance Lustre ==<br />
If you'd like to take on a project to enhance or add a new feature to Lustre, consider one of these options:<br />
<br />
* ''Search [https://bugzilla.lustre.org/query.cgi Bugzilla] for the keyword "small project".'' Some Lustre developers use this keyword to indicate that an enhancement request or bug is a stand-alone project suitable to be taken on by an external developer. When you have identified a project you'd like to work on, contact [mailto:lustre-devel@lists.lustre.org lustre-devel] to discuss the approach be taken to address it.<br />
<br />
* ''Assist with keeping Lustre up-to-date with recent kernel changes.'' For some changes, a simple fix to the Lustre code will be required, while for others, a good understanding of the Linux kernel and how Lustre interfaces with it is needed. For guidance in selecting or proceeding with a project, contact [mailto:lustre-devel@lists.lustre.org lustre-devel].<br />
<br />
* ''Propose a new feature that can be developed as a separate module on top of Lustre.'' Be sure to get feedback on your proposal by contacting [mailto:lustre-devel@lists.lustre.org lustre-devel] before you get started.<br />
<br />
* ''Help port Lustre to a new kernel.'' Porting Lustre to newer kernel versions is an ongoing effort, given the large number of vendor and upstream kernel releases.<br />
<br />
== Helping with Lustre Testing ==<br />
Testing Lustre under a variety of workloads is always of interest. The more unusual the IO pattern used by a benchmark, application, or testing tool, the more likely it is to find something of interest.<br />
<br />
To find out how you can contribute to the testing of upcoming Lustre releases, see [[Lustre_Test_Plans|Lustre Test Plans]].<br />
<br />
== Contributing to Lustre User Documentation ==<br />
You are invited to contribute to the [http://wiki.lustre.org Lustre wiki] or the [[Lustre_Documentation|''Lustre Operations Manual'']]. <br />
<br />
Contribute to the Lustre wiki by:<br />
* Sending your enhancement request or description of a defect to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
* Creating content for a new topic for the Lustre wiki and submitting it to [mailto:lustre-wiki-feedback@sun.com lustre-wiki-feedback@sun.com].<br />
<br />
Contribute to the ''Lustre Operations Manual'' by:<br />
* Sending your enhancement request or description of a defect to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].<br />
* Writing content to fulfill an enhancement request. You can find a project by searching [https://bugzilla.lustre.org/query.cgi Bugzilla] using the search criteria "Documentation" and "Manual Topics". Submit your content to the [mailto:lustre-doc-bugs-team@sun.com Lustre documentation team].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Submitting_Patches&diff=11831Submitting Patches2010-08-04T18:12:55Z<p>Adilger: /* Submitting Patches for Review */</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
'''''NOTICE:''''' A transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
<br />
When you are ready to have your patch reviewed, follow the process described below for submitting it using Bugzilla. <br />
<br />
'''''Note:''''' It is sometimes desirable to solicit reviews of a patch on the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to expose the patch to a wider audience. However, this will ''NOT'' put the patch on track to being accepted into the Lustre™ repository.<br />
<br />
=== Submitting Patches for Review ===<br />
<br />
To have your changes accepted into a mainline Lustre branch, your code must follow the Lustre [[Coding Guidelines]], and be reviewed and approved by senior Lustre engineers. Following these steps will speed up review of your changes and increase the likelihood of success:<br />
<br />
1. Read, complete, and return the form found at [[Media:Sun_Contributor_Agreement_1_5.pdf|Contributor Agreement]]. We cannot accept your contributions without this form. See [[Contribution_Policy|Contribution Policy]] for more information.<br />
<br />
2. Testing the patch is required before it can be submitted. The patch must include any new tests specific to the bug/feature. See [[Testing Lustre Code]] for specific details. <br />
<br />
3. Generate a patch with ''diff -upN'', ''git diff'', or ''git format-patch''. Please do not send other kinds of patches unless your reviewer requests them.<br />
<br />
The easiest command for generating a patch is:<br />
<pre><br />
[lustre]$ git diff {basebranch} > {patchname}.diff<br />
</pre><br />
where ''{basebranch}'' is the branch you are patching against (''b1_6'', ''b1_8'', or ''master''). Note this patch will include committed and uncommitted changes on your branch. If you well-defined patches with proper commit comments as described below, it is also possible to use ''git format-patch'':<br />
<pre><br />
[lustre]$ git format-patch {since}<br />
e.g.<br />
[lustre]$ git format-patch -2 # format the last 2 commits<br />
</pre><br />
<br />
If you are unfamiliar with this process, use ''git diff''.<br />
<br />
If sending changes with ''git format-patch'' we ask that you follow our standard commit message format when making your commits, so that the patch can more easily be identified in the future. If you are doing a rebase, you will get a chance to modify/combine your commit messages. Commit messages for final patches should look like this:<br />
<pre><br />
b=<bugno> <One-line summary of change><br />
<br />
<Full description of change><br />
<br />
i=<inspector1><br />
i=<inspector2><br />
</pre><br />
<br />
If you are not using git format-patch, then simply adding the above lines at the start of the submission email is enough. If you are making a commit prior to submitting the patch for inspection, simply omit the ''i=<inspector>'' lines, and then use ''git commit --amend'' to change the commit comment after inspection is complete.<br />
<br />
4. Find or file a bug corresponding to your contribution in [http://bugzilla.lustre.org/ Bugzilla]. For more information about Bugzilla, see the [[Developers Guide to Bugzilla for Lustre|Developers Guide to Bugzilla]], the [https://bugzilla.lustre.org/page.cgi?id=bug-writing.html Bugzilla - Bug Writing Guidelines], or the [https://bugzilla.lustre.org/docs/html/using.html Bugzilla User Guide].<br />
<br />
* Provide the patch as an Attachment (click on "Add an Attachment")<br />
* Select the "patch" box.<br />
** If submitting a new bug with a patch attached, follow normal bug submission procedures. The support team will assign the bug and inspections as appropriate.<br />
** If working with an Lustre internal engineer, under "Flags" set the ''inspection'' flag to "?" and copy the email address of the engineer into the adjacent ''Requestee:'' field.<br />
** If you have completed testing of the patch, set the "acc-sm_passed_''release'' +" flag for the branch(es) that passed testing. If you have not actually run the acceptance-small.sh test script to completion (unless advised otherwise) you should describe the testing performed to date, and can optionally set the "more-testing_''release'' +" flag.<br />
** If you have not been collaborating with someone on the Lustre team and don't know who should review your work, assign the inspection to ''lustre-rmg-team@sun.com''<br />
* Click on "commit" to submit the attachment and inspection request.<br />
<br />
5. One or more reviewers will submit comments regarding your patch. Iterate the patch until you receive inspection approval, have passed all requested testing, or the bug is closed.<br />
<br />
6. Request patch landing permission by setting the "landing_''release'' ?" flag for your patch.<br />
<br />
7. Once you have landing approval (as given by the branch maintainer in the form of a "landing_''release'' +" flag on the patch, mail the patch to [mailto:lustre-gate-20@sun.com lustre-gate-20] for Lustre 2.0, or [mailto:lustre-gate-18@sun.com lustre-gate-18] for Lustre 1.8. Include the bug number and reviewer in the commit message along with a concise description of the change, as stated above.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_All-Hands_Meeting_12/08&diff=11817Lustre All-Hands Meeting 12/082010-07-27T21:46:53Z<p>Adilger: </p>
<hr />
<div>Once a year, the Lustre™ Engineering team gathers to discuss new features under development and testing efforts. This week-long event is known as the Lustre all-hands meeting. The Development presentations made at the December 2008 all-hands meeting are available here:<br />
<br />
* [[media:Simplified_InteropRecovery.pdf|Simplified Interoperability Recovery]] - Huang Hua<br />
* [[media:RecoveryTalk_2009.pdf|Recovery Overview]] - Robert Read<br />
* [[media:Quotas-TOI.pdf|Quotas-TOI]] - Yong Fan<br />
* [[media:QualityInitiativeTalk.pdf|Quality Initiative Talk]] - Robert Read<br />
* [[media:OST_Pools.pdf|OST Pools]] - Nathan Rutman<br />
* [[media:OST_Migration_RAID1_SNS.pdf|OST Migration RAID1 SNS]] - Andreas Dilger<br />
* [[media:NRS.pdf|Lustre NRS Simulation]] - Yingjin Qian, Wang Di<br />
* [[media:LustreInterop_1_8.pdf|Lustre Interoperability 1.8]] - Huang Hua<br />
* [[media:HDFS.pdf|HDFS]] - Wang Di<br />
* [[media:GIT_Overview.pdf|GIT Overview]] - Robert Read<br />
* [[media:COS-TOI.pdf|COS-TOI]] - Alexander Zarochentsev<br />
* [[media:CLIO-TOI.pdf|CLIO-TOI]] - Nikita Danilov<br />
* [[media:CLIO-TOI-notes.pdf|CLIO-TOI-notes]] - Nikita Danilov<br />
* [[media:CLIO.pdf|CLIO]] - Nikita Danilov</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=File:CLIO-TOI-notes.pdf&diff=11816File:CLIO-TOI-notes.pdf2010-07-27T21:43:34Z<p>Adilger: Nikita's notes for the CLIO-TOI session</p>
<hr />
<div>Nikita's notes for the CLIO-TOI session</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_All-Hands_Meeting_12/08&diff=11815Lustre All-Hands Meeting 12/082010-07-27T21:42:52Z<p>Adilger: add CLIO-TOI-notes</p>
<hr />
<div>Once a year, the Lustre™ Engineering team gathers to discuss new features under development and testing efforts. This week-long event is known as the Lustre all-hands meeting. The Development presentations made at the December 2008 all-hands meeting are available here:<br />
<br />
* [[media:Simplified_InteropRecovery.pdf|Simplified Interoperability Recovery]] - Huang Hua<br />
* [[media:RecoveryTalk_2009.pdf|Recovery Overview]] - Robert Read<br />
* [[media:Quotas-TOI.pdf|Quotas-TOI]] - Yong Fan<br />
* [[media:QualityInitiativeTalk.pdf|Quality Initiative Talk]] - Robert Read<br />
* [[media:OST_Pools.pdf|OST Pools]] - Nathan Rutman<br />
* [[media:OST_Migration_RAID1_SNS.pdf|OST Migration RAID1 SNS]] - Andreas Dilger<br />
* [[media:NRS.pdf|Lustre NRS Simulation]] - Yingjin Qian, Wang Di<br />
* [[media:LustreInterop_1_8.pdf|Lustre Interoperability 1.8]] - Huang Hua<br />
* [[media:HDFS.pdf|HDFS]] - Wang Di<br />
* [[media:GIT_Overview.pdf|GIT Overview]] - Robert Read<br />
* [[media:COS-TOI.pdf|COS-TOI]] - Alexander Zarochentsev<br />
* [[media:CLIO-TOI.pdf|CLIO-TOI]] - Nikita Danilov<br />
* [[media:CLIO-TOI-nodes.pdf|CLIO-TOI-notes]] - Nikita Danilov<br />
* [[media:CLIO.pdf|CLIO]] - Nikita Danilov</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Architecture_-_Write_Back_Cache&diff=11814Architecture - Write Back Cache2010-07-27T21:39:02Z<p>Adilger: /* Definitions */ describe updates and operations</p>
<hr />
<div>'''''Note:''''' ''The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information. <br />
''<br />
== Summary ==<br />
<br />
The meta-data write-back cache (WBC or MDWBC, where a possibility of<br />
misunderstanding exists) allows client meta-data operations to be<br />
delayed and batched. This increases client throughput and improves<br />
both network utilization and server efficiency.<br />
<br />
== Definitions ==<br />
<br />
; '''(MD)WBC''' : (Meta-data) Write-Back Cache<br />
<br />
; '''MD operation''' : A meta-data change performed by a client that changes the namespace in a consistent manner (e.g. create file, unlink, rename, etc.). An '''MD operation''' is normally composed from multiple '''MD updates''' that are performed on one or more MDT devices.<br />
<br />
; '''MD update''' : A low-level change to a single meta-data storage target that form the building blocks of an '''MD operation''' (e.g. insert directory entry, increment link count, change timestamp). Individual MD updates do not necessarily leave the filesystem in a consistent state, and need to be applied to the filesystem in an atomic manner in order to ensure consistency.<br />
<br />
; '''MD batch''' : A group of MD updates performed by a client such that: (a) the batch transforms the file system from one consistent state to another, (b) no other client depends on seeing the file system in any state where some, but not all of the MD operations in the batch are in effect. <br />
<br />
; '''reintegration''' : The process of applying an MD batch on a server. Reintegration executes all the MD operations in the batch and changes the file system from one consistent state to another.<br />
<br />
; '''dependency''' : A situation in which an MD operation modifies multiple separate pieces of client state that are otherwise not related. These dependent pieces of state have to be reintegrated ''atomically'' (in the data-base ACID sense). For example: <br />
<br />
* link and unlink introduce a dependency between the directory where the entry is added or removed, and the target object whose nlink count is updated.<br />
<br />
* cross-directory rename makes the parent directories dependent.<br />
<br />
* unlinking the last name of a file introduces a dependency between the file inode and its stripe objects that are to be destroyed.<br />
<br />
; '''coordinated reintegration''' : A special case of reintegration that occurs when the client cache contains dependent state pertaining to multiple servers. In this case the servers have to act in concert to guarantee consistency. Coordinated reintegration is originated by the client, that sends dependent batches to the servers in parallel. One (or more) server assumes the role of coordinator, and uses persistent logs together with the CUT mechanism to either commit or rollback that distributed transaction.<br />
<br />
; '''object-of-conflict''' : An object in the extent of the lock owned by a client and also in the extent of some conflicting lock that other client is attempting to acquire. I.e., an object where locks "intersect". Single pair of conflicting locks can have more than one object-of-conflict. This term is used in QAS description.<br />
<br />
== Requirements ==<br />
<br />
; '''scalability''' : client should be able to execute 32K creations of 1--64KB files per second. Files maybe created in different directories with file counts per directory to range from 1K to 100K.<br />
<br />
; '''correctness''' : reintegration changes the file system from one globally consistent state to another.<br />
<br />
; '''transactionality''' : reintegration assures that the disk image of the file system is consistent. This implies that reintegration is either done completely within a single transaction, or the batch contains enough information to ''cut'' reintegration into smaller pieces, each preserving consistency.<br />
<br />
; '''concurrency''' : when a client surrenders a meta-data lock it only flushes enough of its cache to guarantee correctness (i.e., flushing the whole meta-data cache is not necessary).<br />
<br />
== Details ==<br />
<br />
Instead of immediately sending MD operations to the server and waiting<br />
for their execution, the client caches them in some form, simulating<br />
their local effects (creating, modifying, and deleting VFS and VM<br />
entities such as inodes, dentries, pages, etc.). Later, a batched<br />
description of the cached operations is sent to the server and<br />
executed there.<br />
<br />
Two important aspects of the WBC are how MD batches are stored on the<br />
client and transported over the network. Possible extremes include<br />
pure (logical) logging where every operation is represented as a<br />
separate entity, and pure physical logging (aka "bulk state update")<br />
where only the latest state is maintained.<br />
<br />
Current design is to store cached MD updates as a some sort of a log in the client memory and to transmit MD batch as a bulk state update. Storing modifications as a log has following advantages:<br />
<br />
* it is possible to create finer grained batches, i.e., to reduce amount of the flushed state by flushing only portion of modified state for a given object;<br />
<br />
* resend and replay are simplified;<br />
<br />
* higher degree of concurrency during reintegration seems possible: to do reintegration, client "cuts" certain prefix of the log and starts reintegrating it with the server. In the meanwhile, operations on the objects involved into reintegration can continue. That seems important, as reintegration of large batch can take (relatively) long time and stop-the-world cache flushing is undesirable.<br />
<br />
Disadvantage is increased memory footprint (or, equivalently, more frequent reintegration).<br />
<br />
Advantage of the sending and applying updates as a batch is off-loading work from the server, effectively rendering meta-data operations closer to the data ones, e.g., ideally, bulk update of the directory pages can be very similar to the bulk update of regular file pages. Disadvantages are<br />
<br />
* the necessity of high level of trust to the clients as they are permitted to carry out complex meta-data modifications, whose consistency cannot be proven by the server, and<br />
<br />
* the necessity to apply the batch as a single transaction, as it cannot be split into the smaller pieces.<br />
<br />
<br />
<br />
== Use Cases ==<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|sub-tree-operations || performance || A client creates a new sub-directory and populates it with a large number of files (and sub-directories, recursively)<br />
|-<br />
|sub-tree-conflict || usability || A client creates a new sub-directory and populates it with a large number of files (and sub-directories, recursively). Another client obtains a conflicting lock on this sub-directory<br />
|-<br />
|undo || performance || A client creates a new sub-directory, populates it with some number of files and then removes them all<br />
|-<br />
|data-consistency || usability || A client executes data and meta-data operations on existing files. Another client obtains a conflicting lock on some data.<br />
|-<br />
|unlink || usability, performance || A client removes a number of (not hard-linked) files and sends a batched update to the server. On successful execution of the batch, the client sends DESTROY rpcs to osts.<br />
|-<br />
|recovery || usability || (A) A client performs a number of MD operations. (B) The client sends the batch to the server. (C) the server executes the batch. (D) The client gets a reply. (E) the server commits the batch. (F) the client gets the commit notification). The server crashes at any of A, B, C, D, E, or F.<br />
|-<br />
|dependency || usability || A client performs an MD operation involving more than one object (link, unlink, etc.). The lock protecting one of the objects involved is revoked.<br />
|-<br />
|rename || usability || A client renames a file across directories. The lock protecting one of these directories is revoked.<br />
|-<br />
|CMD-rename || usability, scalability || A client renames a file across directories located on different MD servers. A lock protecting one of these directories, is revoked.<br />
|-<br />
|}<br />
<br />
== Quality Attribute Scenarios ==<br />
<br />
; '''sub-tree-operations'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || A client creates a new sub-directory and populates it with a large number of small files (and sub-directories, recursively)<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || achieve high client throughput on small file creations<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| performance<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| stream of meta-data and data operations<br />
|-align="left"<br />
|'''Environment:'''|| isolated directory subtree, not accessed by other clients<br />
|-align="left"<br />
|'''Artifact:'''|| lustre client<br />
|-align="left"<br />
|'''Response:'''|| Client executes operations locally, modifying VFS and VM objects in memory. Once enough of MD operations are cached to form an efficient RPC, batch is sent (possibly to multiple servers in parallel), while local operations can continue without a slowdown. Ongoing MD operations require very little communication with the servers, as critical resources (object identifiers, disk space on the OSS server, locks on meta-data objects) are in advance leased to the client (in the form of fid sequences, grants, and sub-tree locks respectively).<br />
|-align="left"<br />
|'''Response measure:'''|| client should be able to execute 32K creations of 1--64KB files per second. Files maybe created in different directories with file counts per directory to range from 1K to 100K. Creation test has to run for 1, 5, and 10 minutes (1.9M, 9.6M, and 19.2M files total respectively).<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| how batch is represented on client? how batch is transmitted over network to the server? what forms on concurrency control are used?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''sub-tree-conflict'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client creates new sub-directory and populates it with large number of files (and sub-directories, recursively). Other client obtains conflicting lock on this sub-directory<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || consistent file system picture for both clients. Low latency of lock acquiring operation.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| performance, scalability, usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| conflicting lock<br />
|-align="left"<br />
|'''Stimulus source:'''|| other client<br />
|-align="left"<br />
|'''Environment:'''|| shared directory <br />
|-align="left"<br />
|'''Artifact:'''|| client MD cache.<br />
|-align="left"<br />
|'''Response:'''|| lock invalidation, including flush of the dependent state. Cached updates for the object-of-conflict are taken and "a minimal batch" is built as a transitive closure of all modifications, depending on the modifications already in the batch. Caching policy guarantees that the minimal batch fits into the single RPC (per-server). If the resulting batch is smaller than the maximal size of RPC, additional state is flushed according to certain policy (e.g., oldest cached updates are flushed).<br />
|-align="left"<br />
|'''Response measure:'''|| locking latency. Latency should be O(cached_state(object-of-conflict)), where cached_state(X) is an amount of cached MD state for the object X.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| how much state to flush? Do we need something similar to ASYNC_URGENT as in data-cache?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''undo'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client creates new sub-directory, populates it with some number of files, and then removes them all<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || efficient handling of temporary files<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| performance, usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| short-lived temporary files<br />
|-align="left"<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
|'''Environment:'''|| isolated directory<br />
|-align="left"<br />
|'''Artifact:'''|| client MD cache.<br />
|-align="left"<br />
|'''Response:'''|| cancellation of the cached state. To achieve this, before forming the batch, log of cached updates is preprocessed by replacing a group of the operations with the smaller number of operations where possible (e.g., creation and removal of the same file cancel each other, leaving only atime/mtime update on the parent directory, and atime/mtime updates on the parent cancel each other, leaving only latest one).<br />
|-align="left"<br />
|'''Response measure:'''|| size of resulting batch (ideally should contain only time updates for the parent directory) should be O(1), i.e., independent of the number of files created and unlinked. <br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| what about the audit on the server? how to efficiently implement state cancellation?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''data-consistency'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client executes data and meta-data operations on existing files, when conflicting lock on some data is requested by other client<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || maintain desired level of visible file system consistency<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| conflicting access to the data range<br />
|-align="left"<br />
|'''Stimulus source:'''|| other client<br />
|-align="left"<br />
|'''Environment:'''|| file/object data and associated meta-data.<br />
|-align="left"<br />
|'''Artifact:'''|| client cached data and meta-data<br />
|-align="left"<br />
|'''Response:'''|| flush of the data and some meta-data. Details are similar to the '''sub-tree-conflict''' case, ''mutatis mutandis''.<br />
|-align="left"<br />
|'''Response measure:'''|| size of resulting batch(es) should be O(cached_state(conflicting-extent)) + O(1), where O(1) is for meta-data update.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| what consistency between data and meta-data we want?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''unlink'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client removes a number of (not hard-linked) files, and sends batched update to the server. On successful execution of a batch, client sends DESTROY rpcs to osts.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || client originated unlink<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, performance, scalability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| batched flush of MD operations<br />
|-align="left"<br />
|'''Stimulus source:'''|| MD cache<br />
|-align="left"<br />
|'''Environment:'''|| cached unlink operation<br />
|-align="left"<br />
|'''Artifact:'''|| flush and reintegration<br />
|-align="left"<br />
|'''Response:'''|| reintegration on the MD server, and file body destruction on the OSS servers. Several implementation strategies are possible. Simplest is to emulate current "client-originated DESTROY" design, where client sends UNLINK rpc to the MD server, and receives "unlink cookie" (stored by the server in a transactional persistent log) that is then broadcast in parallel to all OST servers involved together with the DESTROY rpc. In the case of batched updates this design is complicated, as single batch can contain multiple unlinks.<br />
<br />
To achieve even higher degree on concurrency, a range of cookies can be leased to the client (in the same vein as a range of fid sequences is). In that design client sends rpcs to MD and OS servers in parallel. Every server stores received cookie in the persistent log transactionally with performing the operation in question. Then every OSS contacts coordinator (which is the corresponding MDS) and reports operation completion, allowing the coordinator to cancel llog entry. Additional failure mode introduced by this scenario is when at least one OSS received and carried out operation, while RPC sent to the coordinator was either lost or hasn't yet arrived, when OSS reported operation completion.<br />
|-align="left"<br />
|'''Response measure:'''|| rpc concurrency level. Client sends DESTROY rpcs in parallel.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| some form of distributed transaction commit is probably the cleanest way to implement this.<br />
|-<br />
|}<br />
<br />
; '''recovery'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client performs a number of MD operations. (A) Sends batch to the server. (B) Server executes batch. (C) Client gets reply. (D) Server commits batch. (E) Client gets commit notification. (F). Server crashes at either A, B, C, D, E, or F.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || service availability in the presence of server failures.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| transient server or network failure<br />
|-align="left"<br />
|'''Stimulus source:'''|| act of god<br />
|-align="left"<br />
|'''Environment:'''|| client and server connected by network, all faulty.<br />
|-align="left"<br />
|'''Artifact:'''|| recovery mechanism<br />
|-align="left"<br />
|'''Response:'''|| client recovery<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery completion. Usual recovery guarantees apply. Client has to keep up to O(cached_state) of state for replay.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| in what form batch is kept in memory for recovery purposes?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''dependency'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client performs MD operation, involving more than one object (link, unlink, etc.). Lock protecting of the objects involved is revoked.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || maintain desired level of visible file system consistency<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| revocation of lock on one of the dependent objects.<br />
|-align="left"<br />
|'''Stimulus source:'''|| other client<br />
|-align="left"<br />
|'''Environment:'''|| dependent state in MD cache<br />
|-align="left"<br />
|'''Artifact:'''|| batched cache flush<br />
|-align="left"<br />
|'''Response:'''|| flush of dependent state<br />
|-align="left"<br />
|'''Response measure:'''|| amount of state flushed. In the worst case flush to all MD servers might be needed. All rpcs can be sent in parallel.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| how to track dependencies?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
; '''rename'''<br />
<br />
Special case of '''dependency''' in which dependency is bi-directional: both parent directories depend on each other.<br />
<br />
; '''CMD-rename'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || client renames file across directories, located on different MD servers. Lock, protecting one of these directories, is revoked.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || maintain desired level of visible file system consistency<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, scalability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
| '''Stimulus:'''|| revocation of lock on one of the inter-dependent objects.<br />
|-align="left"<br />
|'''Stimulus source:'''|| other client<br />
|-align="left"<br />
|'''Environment:'''|| dependent state in MD cache<br />
|-align="left"<br />
|'''Artifact:'''|| batched cache flush<br />
|-align="left"<br />
|'''Response:'''|| flush of dependent state to both MD servers.<br />
|-align="left"<br />
|'''Response measure:'''|| amount of state flushed. Updates to both parent directories have to flushed in parallel.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| how to coordinate reintegration on multiple servers?<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| see '''questions'''.<br />
|-<br />
|}<br />
<br />
== Lower level choices ==<br />
<br />
{|cellspacing="0" border="1"<br />
!Description<br />
!Quality<br />
!Semantics<br />
|-<br />
|keep-what||performance, semantics||does client keep cached operations as a log or as an accumulated state or hybrid thereof<br />
|-<br />
|send-what||performance||how cached operations are transferred over network: as a log or as a bulk state update<br />
|-<br />
|cache-flush||usability||what triggers cache flush<br />
|-<br />
|consistency||usability, semantics||what consistency guarantees WBC provides w.r.t. meta-data visibility by other clients<br />
|-<br />
|data-consistency||usability, semantics||what consistency guarantees WBC provides w.r.t. ordering between data and meta-data operations<br />
|-<br />
|lock-conflicts||usability, scalability||how lock conflicts are handled<br />
|-<br />
|stop-the-world||usability||what form of concurrency control is used to achieve consistency and scalability during batched send<br />
|-<br />
|recovery||correctness||server failure is handled<br />
|-<br />
|client-originated-ops||scalability||client originated unlink is implementable<br />
|}<br />
<br />
<br />
<br />
== Issues ==<br />
<br />
It seems that scalability favors at least sending MD operations in form of bulk state update, while data-consistency and stop-the-world are easier to achieve with log-based representation.<br />
<br />
Clustered meta-data: suppose that in CMD setup client renames a file, moving its name from one server to another. '''Correctness''' requirement in this case means that either both servers reintegrate changes, or none of them, which (it seems) implies CMD roll-back, originated and controlled by client.<br />
<br />
Cross-mds MD dependencies introduce the danger of cascading evictions (much like cross-ost locks do).<br />
<br />
Cross-mds operations together with batching require from mdt an ability to coordinate distributes operation ''from any point'', e.g., a situation has to be handled when cross-ref unlink rpc comes to either the server holding directory, or the server holding the object, similarly for rename, etc. It seems logical, that for in the first version cross-ref operations (deemed to be rare) are not cached, as to avoid server modifications.<br />
<br />
== Effort decomposition ==<br />
<br />
The following table also includes (a non exhaustive list of) the sub-components of [[Architecture - Epochs|Epochs]] and [[Architecture - Sub Tree Locks|Sub Tree Locks]].<br />
<br />
C-* tasks are for the client, S-* tasks are for the server. Dependencies marked with (*) are ''weak''.<br />
<br />
{| cellpadding="2" cellspacing="0" border="1"<br />
!Component<br />
!Sub-component<br />
!scope<br />
!depends upon<br />
|-<br />
|WBC<br />
|C-VFS-MM<br />
|integration with vfs: inodes, dentries, memory pressure. Executing operation effects locally.<br />
|<br />
|-<br />
|<br />
|C-ops-caching<br />
|tracking operations: list vs. fragments. Tracking dependencies.<br />
|<br />
|-<br />
|<br />
|C-write-out<br />
|policy deciding when to write-out cached state updates, and with what granularity: age, amount, max-in-flight<br />
|C-grants<br />
|-<br />
|<br />
|C-dir-pages<br />
|caching of directory pages and using them for local lookups<br />
|C-VFS-MM<br />
|-<br />
|<br />
|C-new-files<br />
|creation of new files locally<br />
|C-VFS-MM<br />
|-<br />
|<br />
|C-new-objects<br />
|creation of new objects locally<br />
|S-ost-fids C-VFS-MM<br />
|-<br />
|<br />
|C-DLM<br />
|invoking reintegration on a lock cancel, lock weighting<br />
|C-ops-caching<br />
|-<br />
|<br />
|C-data<br />
|dependencies between cached data and meta-data<br />
|C-ops-caching<br />
|-<br />
|<br />
|C-IO<br />
|switching between whole-file mds-based locking and extent locking<br />
|<br />
|-<br />
|<br />
|C-grants<br />
|unified resource range leasing mechanism<br />
|S-grants<br />
|-<br />
|<br />
|S-grants<br />
|unified resource range leasing mechanism<br />
|S-ost-fid<br />
|-<br />
|<br />
|C-misc<br />
|sync, fsync, compatibility flag, mount option<br />
|C-ops-caching<br />
|-<br />
|<br />
|S-misc<br />
|compatibility flags<br />
|<br />
|-<br />
|STL<br />
|C-policy<br />
|track usage statistics and use them to decide when to ask for an STL<br />
|<br />
|-<br />
|<br />
|S-policy<br />
|track usage statistics and use them to decide when to grant an STL<br />
|<br />
|-<br />
|EPOCHS<br />
|formalization<br />
|formal reintegration model with "proofs" of recovery correctness and concurrency control description<br />
|<br />
|-<br />
|<br />
|C-reintegration<br />
|reintegration, including concurrency control, integration with ptlrpc<br />
|S-compound S-reintegration<br />
|-<br />
|<br />
|S-compound<br />
|implementation of the compound operations on the server<br />
|<br />
|-<br />
|<br />
|S-reintegration<br />
|reintegration of batches on the server, thread scheduling<br />
|<br />
|-<br />
|<br />
|S-undo<br />
|keeping undo logs<br />
|S-gc(*)<br />
|-<br />
|<br />
|S-cuts<br />
|implementation of the CUTs algorithm<br />
|<br />
|-<br />
|<br />
|C-gc<br />
|garbage collection: when to discard cached batches<br />
|<br />
|-<br />
|<br />
|S-gc<br />
|garbage collection: when to discard undo logs<br />
|<br />
|-<br />
|<br />
|C-recovery<br />
|replay, including optional optimistic "pre-replay"<br />
|<br />
|-<br />
|<br />
|S-recovery-0<br />
|roll-back of the uncommitted epochs<br />
|S-gc<br />
|-<br />
|<br />
|S-recovery-1<br />
|roll-forward from the clients<br />
|C-gc<br />
|-<br />
|EXTERNAL<br />
|S-ost-fid<br />
|ost understanding fids, and granting fid sequences to the clients<br />
|<br />
|}<br />
<br />
== References ==<br />
<br />
[https://bugzilla.lustre.org/show_bug.cgi?id=14170 bug 14170]<br />
<br />
[[Architecture - Epochs|Epochs]]</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Submitting_Patches&diff=11802Submitting Patches2010-07-16T22:15:45Z<p>Adilger: /* Submitting Patches for Review */ add format-patch example</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
'''''NOTICE:''''' A transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
<br />
When you are ready to have your patch reviewed, follow the process described below for submitting it using Bugzilla. <br />
<br />
'''''Note:''''' It is sometimes desirable to solicit reviews of a patch on the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to expose the patch to a wider audience. However, this will ''NOT'' put the patch on track to being accepted into the Lustre™ repository.<br />
<br />
=== Submitting Patches for Review ===<br />
<br />
To have your changes accepted into a mainline Lustre branch, your code must be reviewed and approved by senior Lustre engineers. Following these steps will speed up review of your changes and increase the likelihood of success:<br />
<br />
1. Read, complete, and return the form found at [[Media:Sun_Contributor_Agreement_1_5.pdf|Contributor Agreement]]. We cannot accept your contributions without this form. See [[Contribution_Policy|Contribution Policy]] for more information.<br />
<br />
2. Testing the patch is required before it can be submitted. The patch must include any new tests specific to the bug/feature. See [[Testing Lustre Code]] for specific details. <br />
<br />
3. Generate a patch with ''diff -upN'', ''git diff'', or ''git format-patch''. Please do not send other kinds of patches unless your reviewer requests them.<br />
<br />
The easiest command for generating a patch is:<br />
<pre><br />
[lustre]$ git diff {basebranch} > {patchname}.diff<br />
</pre><br />
where ''{basebranch}'' is the branch you are patching against (''b1_6'', ''b1_8'', or ''master''). Note this patch will include committed and uncommitted changes on your branch. If you well-defined patches with proper commit comments as described below, it is also possible to use ''git format-patch'':<br />
<pre><br />
[lustre]$ git format-patch {since}<br />
e.g.<br />
[lustre]$ git format-patch -2 # format the last 2 commits<br />
</pre><br />
<br />
If you are unfamiliar with this process, use ''git diff''.<br />
<br />
If sending changes with ''git format-patch'' we ask that you follow our standard commit message format when making your commits, so that the patch can more easily be identified in the future. If you are doing a rebase, you will get a chance to modify/combine your commit messages. Commit messages for final patches should look like this:<br />
<pre><br />
b=<bugno> <One-line summary of change><br />
<br />
<Full description of change><br />
<br />
i=<inspector1><br />
i=<inspector2><br />
</pre><br />
<br />
If you are not using git format-patch, then simply adding the above lines at the start of the submission email is enough. If you are making a commit prior to submitting the patch for inspection, simply omit the ''i=<inspector>'' lines, and then use ''git commit --amend'' to change the commit comment after inspection is complete.<br />
<br />
4. Find or file a bug corresponding to your contribution in [http://bugzilla.lustre.org/ Bugzilla]. For more information about Bugzilla, see the [[Developers Guide to Bugzilla for Lustre|Developers Guide to Bugzilla]], the [https://bugzilla.lustre.org/page.cgi?id=bug-writing.html Bugzilla - Bug Writing Guidelines], or the [https://bugzilla.lustre.org/docs/html/using.html Bugzilla User Guide].<br />
<br />
* Provide the patch as an Attachment (click on "Add an Attachment")<br />
* Select the "patch" box.<br />
** If submitting a new bug with a patch attached, follow normal bug submission procedures. The support team will assign the bug and inspections as appropriate.<br />
** If working with an Lustre internal engineer, under "Flags" set the ''inspection'' flag to "?" and copy the email address of the engineer into the adjacent ''Requestee:'' field.<br />
** If you have completed testing of the patch, set the "acc-sm_passed_''release'' +" flag for the branch(es) that passed testing. If you have not actually run the acceptance-small.sh test script to completion (unless advised otherwise) you should describe the testing performed to date, and can optionally set the "more-testing_''release'' +" flag.<br />
** If you have not been collaborating with someone on the Lustre team and don't know who should review your work, assign the inspection to ''lustre-rmg-team@sun.com''<br />
* Click on "commit" to submit the attachment and inspection request.<br />
<br />
5. One or more reviewers will submit comments regarding your patch. Iterate the patch until you receive inspection approval, have passed all requested testing, or the bug is closed.<br />
<br />
6. Request patch landing permission by setting the "landing_''release'' ?" flag for your patch.<br />
<br />
7. Once you have landing approval (as given by the branch maintainer in the form of a "landing_''release'' +" flag on the patch, mail the patch to [mailto:lustre-gate-20@sun.com lustre-gate-20] for Lustre 2.0, or [mailto:lustre-gate-18@sun.com lustre-gate-18] for Lustre 1.8. Include the bug number and reviewer in the commit message along with a concise description of the change, as stated above.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Submitting_Patches&diff=11801Submitting Patches2010-07-16T22:09:05Z<p>Adilger: /* Submitting Patches for Review */ move testing flags setting</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
'''''NOTICE:''''' A transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
<br />
When you are ready to have your patch reviewed, follow the process described below for submitting it using Bugzilla. <br />
<br />
'''''Note:''''' It is sometimes desirable to solicit reviews of a patch on the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to expose the patch to a wider audience. However, this will ''NOT'' put the patch on track to being accepted into the Lustre™ repository.<br />
<br />
=== Submitting Patches for Review ===<br />
<br />
To have your changes accepted into a mainline Lustre branch, your code must be reviewed and approved by senior Lustre engineers. Following these steps will speed up review of your changes and increase the likelihood of success:<br />
<br />
1. Read, complete, and return the form found at [[Media:Sun_Contributor_Agreement_1_5.pdf|Contributor Agreement]]. We cannot accept your contributions without this form. See [[Contribution_Policy|Contribution Policy]] for more information.<br />
<br />
2. Testing the patch is required before it can be submitted. The patch must include any new tests specific to the bug/feature. See [[Testing Lustre Code]] for specific details. <br />
<br />
3. Generate a patch with ''diff -upN'', ''git diff'', or ''git format-patch''. Please do not send other kinds of patches unless your reviewer requests them.<br />
<br />
The command for generating a patch is:<pre><br />
[lustre]$ git diff {basebranch} > {patchname}.diff<br />
</pre><br />
where ''{basebranch}'' is the branch you are patching against (''b1_6'', ''b1_8'', or ''master''). Note this patch will include committed and uncommitted changes on your branch. If you have nicely squashed your commit history, feel free to use ''git format-patch''. If you are unfamiliar with this process, use ''git diff''.<br />
<br />
If sending changes with ''git format-patch'' we ask that you follow the standard commit message format when making your commits, so that the patch can more easily be identified in the future. If you are doing a rebase, you will get a chance to modify/combine your commit messages. Commit messages for final patches should look like this:<br />
<pre><br />
b=<bugno> <Single line summary of change><br />
<br />
<In depth description><br />
<br />
i=<inspector1><br />
i=<inspector2><br />
</pre><br />
<br />
If you are not using git format-patch, then simply adding the above lines at the start of the submission email is enough.<br />
<br />
4. Find or file a bug corresponding to your contribution in [http://bugzilla.lustre.org/ Bugzilla]. For more information about Bugzilla, see the [[Developers Guide to Bugzilla for Lustre|Developers Guide to Bugzilla]], the [https://bugzilla.lustre.org/page.cgi?id=bug-writing.html Bugzilla - Bug Writing Guidelines], or the [https://bugzilla.lustre.org/docs/html/using.html Bugzilla User Guide].<br />
<br />
* Provide the patch as an Attachment (click on "Add an Attachment")<br />
* Select the "patch" box.<br />
** If submitting a new bug with a patch attached, follow normal bug submission procedures. The support team will assign the bug and inspections as appropriate.<br />
** If working with an Lustre internal engineer, under "Flags" set the ''inspection'' flag to "?" and copy the email address of the engineer into the adjacent ''Requestee:'' field.<br />
** If you have completed testing of the patch, set the "acc-sm_passed_''release'' +" flag for the branch(es) that passed testing. If you have not actually run the acceptance-small.sh test script to completion (unless advised otherwise) you should describe the testing performed to date, and can optionally set the "more-testing_''release'' +" flag.<br />
** If you have not been collaborating with someone on the Lustre team and don't know who should review your work, assign the inspection to ''lustre-rmg-team@sun.com''<br />
* Click on "commit" to submit the attachment and inspection request.<br />
<br />
5. One or more reviewers will submit comments regarding your patch. Iterate the patch until you receive inspection approval, have passed all requested testing, or the bug is closed.<br />
<br />
6. Request patch landing permission by setting the "landing_''release'' ?" flag for your patch.<br />
<br />
7. Once you have landing approval (as given by the branch maintainer in the form of a "landing_''release'' +" flag on the patch, mail the patch to [mailto:lustre-gate-20@sun.com lustre-gate-20] for Lustre 2.0, or [mailto:lustre-gate-18@sun.com lustre-gate-18] for Lustre 1.8. Include the bug number and reviewer in the commit message along with a concise description of the change, as stated above.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=11800Coding Guidelines2010-07-15T07:42:44Z<p>Adilger: /* Lustre Guidelines */</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
== Beautiful Code == <br />
<br />
''A note from Eric Barton, a Lustre pioneer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''rq_reqmsg'' and ''rq_repmsg'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* breaks if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
* When allocating/freeing a single struct, use OBD_ALLOC_PTR() for clarity:<br />
<pre><br />
right:<br />
OBD_ALLOC_PTR(mds_body);<br />
OBD_FREE_PTR(mds_body);<br />
<br />
wrong:<br />
OBD_ALLOC(mds_body, sizeof(*mds_body));<br />
OBD_FREE(mds_body, sizeof(*mds_body));<br />
</pre><br />
<br />
* Do not embed operations inside assertions. If assertions are disabled for performance reasons this code will not be executed.<br />
<pre><br />
right:<br />
len = strcat(foo, bar);<br />
LASSERT(len > 0);<br />
<br />
wrong:<br />
LASSERT(strcat(foo, bar) > 0);<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=11799Coding Guidelines2010-07-15T07:41:58Z<p>Adilger: /* Lustre Guidelines */ add OBD_ALLOC_PTR()</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
== Beautiful Code == <br />
<br />
''A note from Eric Barton, a Lustre pioneer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''rq_reqmsg'' and ''rq_repmsg'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* will break if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
* When allocating/freeing a single struct, use OBD_ALLOC_PTR() for clarity:<br />
<pre><br />
right:<br />
OBD_ALLOC_PTR(mds_body);<br />
OBD_FREE_PTR(mds_body);<br />
<br />
wrong:<br />
OBD_ALLOC(mds_body, sizeof(*mds_body));<br />
OBD_FREE(mds_body, sizeof(*mds_body));<br />
<br />
* Do not embed operations inside assertions. If assertions are disabled for performance reasons this code will not be executed.<br />
<pre><br />
right:<br />
len = strcat(foo, bar);<br />
LASSERT(len > 0);<br />
<br />
wrong:<br />
LASSERT(strcat(foo, bar) > 0);<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=11798Coding Guidelines2010-07-14T19:14:01Z<p>Adilger: /* Beautiful Code */</p>
<hr />
<div><small>''(Updated: Jan 2010)''</small><br />
== Beautiful Code == <br />
<br />
''A note from Eric Barton, a Lustre pioneer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''rq_reqmsg'' and ''rq_repmsg'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* will break if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Architecture_-_Interoperability_fids_zfs&diff=11795Architecture - Interoperability fids zfs2010-07-12T23:48:51Z<p>Adilger: minor updates to FID_SEQ table</p>
<hr />
<div>'''''Note:''''' ''The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain both outdated information and unimplemented functionality.'' <br />
<br />
== Summary ==<br />
<br />
This document describes an architecture for client, server, network, and storage interoperability during migration from 1.6-based, fidless Lustre clusters, using ldiskfs as a back-end file system, to clusters based on fids and zfs file system.<br />
<br />
== Definitions ==<br />
<br />
As release numbers and numbering schemas are in flux, the description below uses symbolic names for various important points in Lustre development.<br />
<br />
; '''OLD''' : any major release in b1_6 line of development. This might end up being 1.6.something, or 1.7.<br />
; '''OLD.x''' : a release in b1_6 line containing client that is able to interact with a NEW.0 md server. (Tentatively 1.8.)<br />
; '''NEW.0''' : first release based on HEAD. This features kernel server, and uses ldiskfs as a back-end. This is (tentatively) 2.0. It is important to note that NEW.0 is a temporary intermediate release whose purpose is to effect transition from ldiskfs-based to DMU-based clusters.<br />
; '''NEW.1''' : next release based on HEAD. This release introduces support for fids on OST, and DMU as a back-end, in addition to continued support for ldiskfs. This is (tentatively) 2.x.<br />
; '''OLD protocol''' : b1_6 wire network protocol.<br />
; '''NEW protocol''' : wire protocol using fids for object identification.<br />
; '''OLD storage''', '''OLD file system''' : back-end file system of type ldiskfs.<br />
; '''DMU storage''' : back-end file system implemented through DMU.<br />
; '''fill-in-fid''' : a special not otherwise used fid value, reserved to indicate in a CREATE RPC that client requests server to generate fid for newly created object on client's behalf. This fid is taken from one of the system-reserved fid sequences.<br />
<br />
== Requirements ==<br />
<br />
; '''+-1 rule''' : adhere to the Lustre promise of maintaining interoperability one release back and forth.<br />
; '''downgrade''' : users are able to abandon upgrade and return back to the old cluster configuration up to a well-defined point of no-return when a decision is made to proceed forward. After that point downgrade is possible, on a condition that (potentially) all file system modifications made after no-return are lost.<br />
; '''rolling upgrade''' : an upgrade (and downgrade) is performed in a piecemeal fashion, a node after a node.<br />
; '''continuity''' : where possible upgrade and downgrade do not disrupt ongoing operations. Client upgrade or downgrade obviously requires client remount. Server upgrade and downgrade looks like a server fail-over, with clients operations continuing.<br />
; '''no stop-the-world''' : migration path cannot require whole cluster to be stopped for a prolonged amount of time (e.g,. to migrate all data to the new format).<br />
<br />
== Compatibility matrix ==<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|-<br />
! || colspan=3|OLD || colspan=3|OLD.x || colspan=3|NEW.0 ||colspan=3|NEW.1<br />
|-<br />
! || C || O || M || C || O || M || C || O || M || C || O || M <br />
|-<br />
! OLD protocol <br />
| X || X || X || X || X || X || - || X || - || - || - || -<br />
|-<br />
! NEW protocol <br />
| - || - || - || X || - || - || X || - || X || X || X || X<br />
|-<br />
! OLD storage <br />
| bgcolor=707070| || X || X || bgcolor=707070| || X || X || bgcolor=707070| || X || X || bgcolor=707070| || - || -<br />
|-<br />
! DMU storage <br />
| bgcolor=707070| || - || - ||bgcolor=707070| || - || - ||bgcolor=707070| || - || - ||bgcolor=707070| || X || X<br />
|}<br />
<br />
Legend<br />
<br />
; '''C''' : client<br />
; '''O''' : OSS<br />
; '''M''' : MDT<br />
; '''X''' : given version supports given format or protocol<br />
; '''-''' : given version does not support given format or protocol<br />
; gray area : impossible combination<br />
<br />
== Migration path ==<br />
<br />
Following upgrade path is envisaged:<br />
<br />
* starting with OLD version installed on the cluster...<br />
* OLD.x release is installed, making clients upward compatible with NEW.0 MDT server. This step can be undone without loss of functionality or availability.<br />
* all clients are upgraded to OLD.x.<br />
* NEW.0 md server is installed, and original (OLD.x md server) is failed over to the former. Clients can continue without evictions. This step can be undone with the minor loss of availability (e.g., evictions during downgrade).<br />
* NEW.0 release is installed on client and OSS nodes. Client has to unmount and remount file system to continue with the new release. This step can be undone with the minor loss of availability (again, unmount followed by remount to revert back to the old release).<br />
* clients and OST's are upgraded to NEW.1 release. At that moment, no OLD code is running in the cluster, but all data and meta-data are still stored in the OLD format, except for the redundant information, like object index, and fids in EA, not used by the OLD server.<br />
* MDT fails over to NEW.1. On a reconnect, OST's switch to NEW protocol. At this moment, all networking traffic is in NEW protocol.<br />
* NEW.1 dmu based ost's are formatted and added to the cluster.<br />
* online migration of data starts. This step can be undone without loss of functionality or availability.<br />
* NEW.1 DMU mdt is formatted. Magic meta-data migration tool is invoked. '''?Q not clear yet. Downgrade?''' <br />
* once meta-data are migrated to the NEW.1, upgrade is complete.<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|-<br />
! Label || Client || OSS || MDT || Upgrade comment (read top-to-bottom) || Downgrade comments (read bottom-to-top)<br />
|-<br />
| all-old || OLD || OLD || OLD || original configiration ||rowspan="3"|downgrade of clients, OSS and MDT to OLD can be performed in any order<br />
|-<br />
| client-old.x || OLD.x || OLD || OLD ||rowspan="3"|upgrade of clients, OSS and MDT to OLD.x can be performed in any order<br />
|-<br />
| oss-old.x || OLD.x || OLD.x || OLD ||<br />
|-<br />
| all-old.x || OLD.x || OLD.x || OLD.x || MDT is failed over to OLD.x version. On reconnect clients and OSS servers recognize downgrade and switch to the OLD protocol.<br />
|-<br />
|mdt-new.0 || OLD.x ||OLD.x || NEW.0 || as new server is failed over to, OLD.x clients recognize this and start using NEW protocol to talk to MDT. OST still uses OLD protocol to talk to the MDT. ||rowspan="2"| clients are downgraded to OLD.x version in any order. They continue to speak NEW protocol. If SOM was activated during upgrade, no further downgrade is possible.<br />
|-<br />
|client-new.0 || NEW.0 ||OLD.x || NEW.0 || rowspan="2"|clients and OSSes are upgraded to NEW-protocol-only version in any order.<br />
|-<br />
|all-new.0 || NEW.0 ||NEW.0 || NEW.0 || SOM is de-activated on the MDT, if it was enabled.<br />
|-<br />
|new.0-som || NEW.0 ||NEW.0 || NEW.0 || (Optional) SOM is activated on the MDT. || all data are in OLD format.<br />
|-<br />
|client-new.1 || NEW.1 || NEW.1 || NEW.0 || Clients and OST's are upgarded to NEW.1 in any order. OST's continue to talk to the MDT using old protocol. || OST's migrate back to NEW.0<br />
|-<br />
|mdt.1 || NEW.1 || NEW.1 || NEW.1 || MDT fails over to NEW.1 version, and announced to OST's that it talks NEW protocol. OST's switch to NEW protocol on reconnect || MDT fails over to the NEW.0 version. OST's switch to the OLD protocol on reconnect.<br />
|-<br />
|data.dmu || NEW.1 || NEW.1 || NEW.1 || New DMU-based OST's are formatted and added to the cluster. Data migration starts. || ldiskfs-based NEW.1 OST's are added into cluster and data are migrated back to them.<br />
|-<br />
|all-data.dmu || NEW.1 ||NEW.1 || NEW.1 || all data are on DMU OSS servers.|| original configuration<br />
|-<br />
|colspan="6"|point-of-no-return.<br />
|-<br />
|all-dmu || NEW.0 ||NEW.1 || NEW.1 || meta-data is converted (offline?) to new DMU based MDT.|| downgrade is not possible from here.<br />
|}<br />
<br />
== Use Cases ==<br />
<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|old.x-client || usability || OLD.x client is introduced into otherwise OLD cluster.<br />
|-<br />
|mdt.upgrade.0 || usability, availability || OLD.x MDT fails over to NEW.0 MDT<br />
|-<br />
|mdt.upgrade.0.client ||availability || "...": client reconnection and recovery<br />
|-<br />
|new.1-ost || usability || NEW.1 OST is added to a cluster containing NEW.1 clients.<br />
|-<br />
|mdt.upgrade || usability, availability || NEW.0 MDT fails over to NEW.1 MDT<br />
|-<br />
|mdt.upgrade.1.ost ||availability || "...": OST reconnection and recovery<br />
|-<br />
|mdt.downgrade.0 || usability, availability || NEW.0 MDT fails over to OLD.x MDT.<br />
|-<br />
|mdt.downgrade.0.client ||availability || "...": client reconnection and recovery<br />
|-<br />
|mdt.downgrade.1 || usability, availability || NEW.1 MDT fails over to NEW.0 MDT.<br />
|-<br />
|mdt.downgrade.1.ost ||availability || "...": OST reconnection and recovery<br />
|}<br />
<br />
NEW.0 MDT handles...<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|mdt.lookup.old || correctness || LOOKUP for a file created by OLD MDT<br />
|-<br />
|mdt.lookup.new.0 || correctness || LOOKUP for a file created by NEW.0 MDT<br />
|-<br />
|mdt.create || correctness || CREATE with a fid supplied by a client<br />
|-<br />
|mdt.readdir || correctness || READDIR<br />
|}<br />
<br />
NEW.0 OST handles ...<br />
{| border=1 cellspacing=0<br />
|-<br />
!id !! quality attribute !! summary<br />
|-<br />
|ost.lookup.old || correctness || LOOKUP for a file created by OLD OST<br />
|-<br />
|ost.lookup.new.0 || correctness || LOOKUP for a file created by NEW.0 OST<br />
|-<br />
|ost.create || correctness || CREATE with a fid supplied by a client<br />
|-<br />
|ost.unlink || correctness || UNLINK<br />
|}<br />
<br />
== Quality Attribute Scenarios ==<br />
<br />
; '''old.x-client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x client is introduced into otherwise OLD cluster.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || permit rolling upgrade<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with OLD release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| lustre client<br />
|-align="left"<br />
|'''Response:'''|| OLD client unmounts, OLD.x release is installed on a cluster node. Client connects to the MDT, requesting OBD_CONNECT_FID, which is not granted. Client detects that it connected to the OLD MDT.<br />
|-align="left"<br />
|'''Response measure:'''|| client should be able to talk to the OLD MDT.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| <br />
|-<br />
|}<br />
<br />
; '''new.1-ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 OST is added to a cluster containing NEW.1 clients<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || permit rolling server upgrade<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| NEW.0 OST fails over to NEW.1 version. OST reconnects to MDT, requesting OBD_CONNECT_FID, which is not granted. OST detects that it connected to NEW.0 MDT, and clears OBD_CONNECT_FID bit in '''its''' supported connection flags mask, forcing all reconnecting clients into OLD mode.<br />
|-align="left"<br />
|'''Response measure:'''|| OST should be able to talk to the NEW.0 MDT and NEW.0 clients.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| <br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.0'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x MDT fails over to NEW.0 MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || upgrade to NEW.0 without downtime<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability, availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with OLD.x release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over MDT creates missing NEW.0 files (/oi, /fld, /seq, etc.), and starts recovery, accepting NEW-protocol connections from the clients, and OLD protocol connections from OS servers. When receiving replay of a CREATE rpc with a fill-in-fid, MDT generates fid internally (using seq service), and returns it to client.<br />
|-align="left"<br />
|'''Response measure:'''|| Fail-over and recovery have to complete successfully<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| recovery, see following scenarios<br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.0.client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || OLD.x MDT fails over to NEW.0 MDT, client reconnects and replays.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || successful recovery <br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| client<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over client gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.0 MDT. It continues to use OLD protocol to talk to OST's. Client proceeds with recovery, converting requests into new format, and converting inode numbers in RPCs into fids. For CREATE RPCs, some otherwise impossible fill-in-fid (from system-reserved fid sequence) is used, to indicate that server has to generate fid. Client should be ready that server can over-write client supplied fid in any CREATE rpc. There should be no need to rebuild any internal data structures (locks, inode table, pages, etc.) as all objects are identified by fids internally in OLD.x mode.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.upgrade.1.ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to NEW.1 MDT, OST reconnects and replays.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || successful recovery <br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| upgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over OST gets OBD_CONNECT_FID bit from MDT and detects that it now talks to NEW.1 MDT. OST sets OBD_CONNECT_FID in its own supported connect bits mask. OST proceeds with MDT-OST recovery, converting requests into new format, and converting inode numbers in RPCs into fids.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.0'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to OLD.x MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, MDT starts OLD-protocol recovery, accepting connections in OLD protocol.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.0.client'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT fails over to OLD.x MDT: client reconnection and recovery<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of OLD.x and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| client<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, client reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, client switches to OLD.x mode, and starts replay, converting RPCs to the OLD protocol. If client is unable to convert an RPC, because it doesn't know inode number corresponding to the fid, it evicts itself.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| Search for "KABOOM" on this page.<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.1'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 MDT fails over to NEW.0 MDT<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.1 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, MDT starts recovery, accepting connections in OLD protocol from OST's and in NEW protocol from clients.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| <br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.downgrade.1.ost'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.1 MDT fails over to NEW.0 MDT: ost reconnection and recovery<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || downgrade with a minimal loss of availability<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| availability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| cluster administrator<br />
|-align="left"<br />
| '''Stimulus:'''|| downgrade schedule<br />
|-align="left"<br />
|'''Environment:'''|| cluster with a mixture of NEW.0 and NEW.0 releases of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| OST<br />
|-align="left"<br />
|'''Response:'''|| After a fail-over, OST reconnects, and is denied OBD_CONNECT_FID bit. Recognizing that MDT was downgraded, OST switches to NEW.0 mode, clears OBD_CONNECT_FID bit in its supported connect flags mask, and starts replay, converting RPCs to the OLD protocol.<br />
|-align="left"<br />
|'''Response measure:'''|| successful recovery<br />
|-align="left"<br />
|colspan=2|'''Questions:'''|| Search for "KABOOM" on this page.<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.lookup.old'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by OLD.x server.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || access to existing data and meta-data<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA doesn't exists (see next QAS otherwise), server learns that inode was created by OLD.x server, generates igif fid from (inode number, inode generation) pair, and sends this fid to client as lookup result.<br />
|-align="left"<br />
|'''Response measure:'''|| consistent lookup result that can later be used to access file<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.lookup.new'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles LOOKUP(pdir, name) RPC, where name refers to the file created by NEW.0 server.<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || access to newly created data and meta-data<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Given a fid of parent directory, server translates it into inode number (either by doing igif->ino computation, or using /oi index), loads directory inode and looks given name up. If name is found (-ENOENT otherwise), MDT loads inode and checks for "FID" EA. Assuming EA exists (see previous QAS otherwise), server learns that inode was created by NEW.0 server, interprets EA contents as a fid, and sends this fid to client as lookup result.<br />
|-align="left"<br />
|'''Response measure:'''|| consistent lookup result that can later be used to access file<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''|| Possible sanity check: once fid was determined, check that /oi maps this fid to the inode number that was found in the directory.<br />
|-<br />
|}<br />
<br />
; '''mdt.create'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles CREATE(fid) RPC, with fid supplied by a client<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || create object that can later be accessed through client supplied fid.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| If fid equals to special fill-in-fid constant, MDT generates new fid from an internal fid sequence. New inode is created. "FID" EA is allocated for this inode and filled with the fid. New (inode-number, inode-generation) record is inserted into /oi index with the fid as a key.<br />
|-align="left"<br />
|'''Response measure:'''|| new object created, and can be accessed by fid later.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
; '''mdt.readdir'''<br />
<br />
{|border=1 cellspacing="0"<br />
|-align="left" <br />
|colspan=2|'''Scenario:''' || NEW.0 MDT handles READPAGE(parent-fid, offset) RPC<br />
|-align="left" <br />
|colspan=2|'''Business Goals:''' || return a page filled with NEW protocol directory entries, provide access to both new and old objects through readdir.<br />
|-align="left" <br />
|colspan=2|'''Relevant QA's:'''|| usability<br />
|-align="left"<br />
|rowspan="6" writing-mode="vertical"|'''details'''<br />
|'''Stimulus source:'''|| client application<br />
|-align="left"<br />
| '''Stimulus:'''|| RPC<br />
|-align="left"<br />
|'''Environment:'''|| cluster with NEW.0 release of lustre installed<br />
|-align="left"<br />
|'''Artifact:'''|| MDT<br />
|-align="left"<br />
|'''Response:'''|| Using dt-index iterators interface (internally based on ldiskfs_readdir()), MDT iterates over directory entries, and places file names and their hashed into directory entries. For every entry corresponding inode is loaded into memory. If inode contains "FID" EA, its contents is used as a fid, and is placed into readdir page. Otherwise, igif fid is generated, and placed into readdir page.<br />
|-align="left"<br />
|'''Response measure:'''|| pre-existing object, created by OLD.x server, are visible through readdir.<br />
|-align="left"<br />
|colspan=2|'''Questions:'''||<br />
|-align="left"<br />
|colspan=2|'''Issues:'''||<br />
|-<br />
|}<br />
<br />
== Technical Details [not part of architecture, should go into HLD/DLD]==<br />
<br />
Brief outline of features relevant to interoperability and not mentioned above, supported and expected from the releases above:<br />
<br />
=== OLD.x ===<br />
<br />
* OLD.x: client and OST support both OLD and NEW networking protocol. Protocol version is selected at the time of connection to MDT: if MDT supports OBD_CONNECT_FID connect flag, NEW protocol is used, otherwise OLD.<br />
* once OLD.x node (client or OST) connected to MDT in NEW mode it assures that all other connections are in this mode too. OST adds OBD_CONNECT_FID flag to its connection mask.<br />
* when connected in NEW node, OLD.x client<br />
** uses fids to identify inodes in the cache (for uniformity, it can internally use igifs, generated from ino/gen pairs in the OLD mode too). Inode numbers for stat(2), are generated from fids ['''done for HEAD, being ported to b1_6_cli_reqs'''];<br />
** expects cmd3-style directory pages in readdir with fids in directory entries ['''done'''];<br />
** takes dlm locks are in fid name-space ['''done'''];<br />
** participates in cmd3 recovery protocol, more on this below ['''being implemented by Amit'''];<br />
** uses seq and fld services ['''done'''];<br />
* when on a re-connect OLD.x client detects that connection lost OBD_CONNECT_FID flag that it used to have, it evicts itself to get rid of all extra fid-related state.<br />
** No interoperability changes to the MD server code are made in OLD.x release. <br />
* OLD.x OST servers also support both OLD and NEW networking protocol, and depending on the MDS connection flags either use fids or not. In fid-enabled mode, they act much like clients (see above) in their interaction with MDT. To support NEW protocol OST has to generate fids for objects already existing on the storage. Resulting surrogate fids are called idifs (igifs for data, see igif description below). ['''not started yet''']<br />
<br />
=== NEW.0 ===<br />
<br />
This release introduces MDT server speaking NEW protocol only, and running over OLD-format storage. OST server speaking NEW protocol was introduced in the previous OLD.x release. Support for old protocol is completely eliminated in this release.<br />
<br />
To talk in new protocol server has to use FIDs to identify object, so NEW.0 MDT generates ''surrogate'' FIDs for existing inodes. Such a surrogate FIDs is referred to as an ''IGIF'' (inode-generation FID), because it is built from inode number and inode generation. Similarly, NEW.0 OST generates surrogate FIDs for existing id/group objects. Format of IGIF and IDIF is described in the table below:<br />
<br />
{| border=1 cellspacing=0 cellpadding="5"<br />
|fields ||SEQ ||OID ||VER<br />
|-<br />
|FID_SEQ_OST_MDT0 ||= 0 || ||<br />
|-<br />
|FID_SEQ_LLOG ||= 1 || ||<br />
|-<br />
|FID_SEQ_ECHO ||= 2 || ||<br />
|-<br />
|FID_SEQ_OST_MDT1 ||= 3 || ||<br />
|-<br />
|FID_SEQ_OST_MAX ||= 9 (=FID_SEQ_OST_MDT7) || ||<br />
|-<br />
|FID_SEQ_IGIF ||= 12 || ||<br />
|-<br />
|FID_SEQ_IGIF_MAX ||= 0xffffffff || ||<br />
|-<br />
|FID_SEQ_IDIF ||=0x100000000 || ||<br />
|-<br />
|FID_SEQ_IDIF_MAX ||=0x1ffffffff || ||<br />
|-<br />
|FID_SEQ_LOCAL_FILE||=0x200000001 || ||<br />
|-<br />
|FID_SEQ_DOT_LUSTRE||=0x200000002 || ||<br />
|-<br />
|FID_SEQ_NORMAL ||=0x200000400 || ||<br />
|-<br />
|-<br />
|obdo/lmm/oinfo(OLD)||o_seq:64 [FID_SEQ_OST_MDT0] ||o_id_lo:48||o_id_hi:16<br />
|-<br />
|obdo/lmm/oinfo(NEW.1)||o_seq:64 [FID_SEQ_{IDIF,NORMAL}]||o_id_lo:32||o_id_hi:32<br />
|-<br />
|lu_fid ||f_seq:64 ||f_oid:32 ||f_ver:32<br />
|-<br />
|IGIF ||0:32, ino:32 [12,FID_SEQ_IGIF_MAX] ||gen:32 ||0:32<br />
|-<br />
|IDIF ||0:31, 1:1, ost_idx:16,o_id_hi:16 ||o_id_lo:32||o_id_hi_hi:16<br />
|-<br />
|reserved ||[FID_SEQ_START,FID_SEQ_START+0x3ff]||f_oid:32 ||f_ver:32<br />
|-<br />
|FID ||[FID_SEQ_NORMAL,2<sup>64</sup>-1] ||f_oid:32 ||f_ver:32<br />
|}<br />
<br />
Legend:<br />
; '''FID''' : File IDentifier generated by client from range allocated by the seq service. First 0x400 sequences [2<sup>33</sup>, 2<sup>33</sup> + 0x400] are reserved for system use. Note that on ldiskfs MDTs that IGIF FIDs can use inode numbers starting at 12, but this is in the IGIF SEQ rangeand does not conflict with assigned FIDs.<br />
<br />
; '''IGIF''' : Inode and Generation In FID, a surrogate FID used to globally identify an existing object on OLD formatted MDT file system. This would only be used on MDT0 in a CMD filesystem, because there are not expected to be any OLD formatted CMD filesystems. Belongs to a sequence in [12, 2<sup>32</sup> - 1] range, where sequence number is inode number, and inode generation is used as OID. '''NOTE''': This assumes no more than 2<sup>32</sup>-1 inodes exist in the MDT filesystem, which is the maximum possible for an ldiskfs backend. '''NOTE''': This assumes that the reserved ext3/ext4/ldiskfs inode numbers [0-11] are never visible to clients, which has always been true.<br />
<br />
; '''IDIF''' : object ID in FID, a surrogate FID used to globally identify an existing object on OLD formatted OST file system. Belongs to a sequence in [2<sup>32</sup>, 2<sup>33</sup> - 1]. Sequence number is calculated as:<br />
<pre><br />
1 << 32 | (ost_index << 16) | ((objid >> 32) & 0xffff)<br />
</pre><br />
; ''' ''' : that is, SEQ consists of 16-bit OST index, and higher 16 bits of object ID. The generation of unique SEQ values per OST allows the IDIF FIDs to be identified in the FLD correctly. The OID field is calculated as:<br />
<pre><br />
objid & 0xffffffff<br />
</pre><br />
; ''' ''' : that is, it consists of lower 32 bits of object ID. '''NOTE''' This assumes that no more than 2<sup>48</sup>-1 objects have ever been created on an OST, and that no more than 65535 OSTs are in use. Both are very reasonable assumptions (can uniquely map all objects on an OST that created 1M objects per second for 9 years, or combinations thereof).<br />
<br />
; '''OST_MDT0''' : Surrogate FID used to identify an existing object on OLD formatted OST filesystem. Belongs to the reserved sequence 0, and is used internally prior to the introduction of FID-on-OST, at which point IDIF will be used to identify objects as residing on a specific OST.<br />
<br />
; '''ECHO''' : for testing OST IO performance the object sequence 1 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.<br />
<br />
; '''LLOG''' : for Lustre Log objects the object sequence 2 is used. This is compatible with both OLD and NEW.1 namespaces, as this SEQ number is in the ext3/ldiskfs reserved inode range and does not conflict with IGIF sequence numbers.<br />
<br />
; '''OST_MDT1''' .. '''OST_MAX''' : for testing with multiple MDTs the object sequence 3 through 9 is used, allowing direct mapping of MDTs 1 through 7 respectively, for a total of 8 MDTs including '''OST_MDT0'''. This matches the legacy CMD project "group" mappings. However, this SEQ range is only for testing prior to any production CMD release, as the objects in this range conflict across all OSTs, as the OST index is not part of the FID.<br />
<br />
<br />
For compatibility with existing OLD OST network protocol structures, the FID must map onto the o_id and o_gr in a manner that ensures existing objects are identified consistently for IO, as well as onto the lock namespace to ensure both IDIFs map onto the same objects for IO as well as resources in the DLM.<br />
<br />
DLM OLD OBIF/IDIF:<br />
resource[] = {o_id, o_seq, 0, 0}; /* o_seq == 0 for production releases */<br />
<br />
DLM NEW.1 FID (this is the same for both the MDT and OST):<br />
resource[] = {SEQ, OID, VER, HASH};<br />
<br />
Note that for mapping IDIF values to DLM resource names the o_id may be larger than the 2<sup>33</sup> reserved sequence numbers for IDIF, so it is possible for the o_id numbers to overlap FID SEQ numbers in the resource. However, in all production releases the OLD o_seq field is always zero, and all valid FID OID values are non-zero, so the lock resources will not collide.<br />
<br />
For objects within the IDIF range, group extraction (non-CMD) will be:<br />
o_id = (fid->f_seq & 0x7fff) << 16 | fid->f_oid;<br />
o_seq = 0; /* formerly group number */<br />
<br />
=== Recovery ===<br />
<br />
There are 2 important recovery scenarios related to interoperability:<br />
<br />
* OLD.x client reconnects to MDT after a fail-over and learns that it has to switch back to the OLD protocol, because server was downgraded. Client has to replay requests, but before that they have to be converted into OLD protocol format. This requires changing message format and going from client-assigned FIDs to inode/generation numbers (storage cookies). If a FID in is IGIF format it can be converted to inode number according to the reverse of IGIF generation algorithm. If a FID is client-generated, then '''*KABOOM*'''! Client has to evict itself, because it doesn't know old-format inode number. '''Q? Is there a better solution?'''. What to do with RPCs that old server cannot handle at all: SEQ_QUERY? Again, eviction seems to be the only option.<br />
<br />
* OLD.x client reconnects to MDT and determines that it has to switch to the new protocol, because MDT was upgraded to NEW.0. To replay RPCs, client has to convert them to the NEW format. This includes message format conversion and going from inode/generation numbers to FIDs. For RPCs that already include inode number as an argument, IGIF FID can be used. For CREATE RPC that requires fid in NEW protocol there are two options:<br />
** client supplies fill-in-FID. NEW.0 server recognizes this as a request to generate FID on the server, and uses special sequence range reserved for this purpose to allocate a FID from. Note that this sequence cannot be exhausted, as there is single MDT in the cluster at that point, which means it has full control over complete FID space.<br />
** client supplies inode number as in usual OLD protocol replay. Server detects this and creates inode with given inode number. This has certain drawbacks:<br />
*** a dependency on ext3-wantedi patch is re-introduced, and<br />
*** backward-compatibility code is introduced in NEW.0 release, which we are trying to avoid.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Submitting_Patches&diff=11779Submitting Patches2010-06-05T02:21:38Z<p>Adilger: /* Submitting Patches for Review */ fix up links for gates, remove 16-gate</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
'''''NOTICE:''''' A transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
<br />
When you are ready to have your patch reviewed, follow the process described below for submitting it using Bugzilla. <br />
<br />
'''''Note:''''' It is sometimes desirable to solicit reviews of a patch on the [mailto:lustre-devel@lists.lustre.org lustre-devel] mailing list to expose the patch to a wider audience. However, this will ''NOT'' put the patch on track to being accepted into the Lustre™ repository.<br />
<br />
=== Submitting Patches for Review ===<br />
<br />
To have your changes accepted into a mainline Lustre branch, your code must be reviewed and approved by senior Lustre engineers. Following these steps will speed up review of your changes and increase the likelihood of success:<br />
<br />
1. Read, complete, and return the form found at [[Media:Sun_Contributor_Agreement_1_5.pdf|Contributor Agreement]]. We cannot accept your contributions without this form. See [[Contribution_Policy|Contribution Policy]] for more information.<br />
<br />
2. Testing the patch is required before it can be submitted. The patch must include any new tests specific to the bug/feature. See [[Testing Lustre Code]] for specific details. If you have completed testing of the patch, set the "acc-sm_passed_''release''" flag to "+" for the branch(es) that passed testing in step 4 below.<br />
<br />
3. Generate a patch with ''diff -upN'', ''git diff'', or ''git format-patch''. Please do not send other kinds of patches unless your reviewer requests them.<br />
<br />
The command for generating a patch is:<pre><br />
[lustre]$ git diff {basebranch} > {patchname}.diff<br />
</pre><br />
where ''{basebranch}'' is the branch you are patching against (''b1_6'', ''b1_8'', or ''master''). Note this patch will include committed and uncommitted changes on your branch. If you have nicely squashed your commit history, feel free to use ''git format-patch''. If you are unfamiliar with this process, use ''git diff''.<br />
<br />
If sending changes with ''git format-patch'' we ask that you follow the standard commit message format when making your commits, so that the patch can more easily be identified in the future. If you are doing a rebase, you will get a chance to modify/combine your commit messages. Commit messages for final patches should look like this:<br />
<pre><br />
b=<bugno> <Single line summary of change><br />
<br />
<In depth description><br />
<br />
i=<inspector1><br />
i=<inspector2><br />
</pre><br />
<br />
If you are not using git format-patch, then simply adding the above lines at the start of the submission email is enough.<br />
<br />
4. Find or file a bug corresponding to your contribution in [http://bugzilla.lustre.org/ Bugzilla]. For more information about Bugzilla, see the [[Developers Guide to Bugzilla for Lustre|Developers Guide to Bugzilla]], the [https://bugzilla.lustre.org/page.cgi?id=bug-writing.html Bugzilla - Bug Writing Guidelines], or the [https://bugzilla.lustre.org/docs/html/using.html Bugzilla User Guide].<br />
<br />
* Provide the patch as an Attachment (click on "Add an Attachment")<br />
* Select the "patch" box.<br />
** If submitting a new bug with a patch attached, follow normal bug submission procedures. The support team will assign the bug and inspections as appropriate.<br />
** If working with an Lustre internal engineer, under "Flags" set the ''inspection'' flag to "?" and copy the email address of the engineer into the adjacent ''Requestee:'' field.<br />
** If you have not been collaborating with someone on the Lustre team and don't know who should review your work, assign the inspection to ''lustre-rmg-team@sun.com''<br />
* Click on "commit" to submit the attachment and inspection request.<br />
<br />
5. One or more reviewers will submit comments regarding your patch. Iterate the patch until you receive inspection approval or the bug is closed.<br />
<br />
6. Once you have approval, mail the patch to [mailto:lustre-gate-20@sun.com lustre-gate-20] for Lustre 2.0, or [mailto:lustre-gate-18@sun.com lustre-gate-18] for Lustre 1.8. Include the bug number and reviewer in the commit message, along with a concise description of the change, as stated above.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Migrating_to_Git&diff=11750Migrating to Git2010-05-20T22:53:16Z<p>Adilger: /* Apply your patches to a git repo */</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
To migrate ongoing work from CVS to git SCM, first convert your work to a patch, then apply that patch to a git tree.<br />
== Convert your work to patches ==<br />
=== Work that does ''NOT'' live in a private CVS branch ===<br />
If you maintain your development code with quilt or something else other than a CVS private branch, generate patches for any current work.<br />
cvs diff > my.patch<br />
<br />
=== Work from a private CVS branch ===<br />
1. In your branch CVS working tree, use 'cvs diff' against the base tree divergence point. Since you've been using the build/merge scripts, this is quite easy. For example, params tree branch:<br />
cvs diff -r HD_PARAMS_TREE_BASE > hd_params_tree.patch<br />
The merge scripts have kept the <branchname>_BASE tag updated to reflect the latest merge (don't use an old dated tag, use the one ending in _BASE).<br />
<br />
This cvs diff will include uncommitted changes in your working tree as well as all your committed code, so make sure your working directory tree is in the state you want. <br />
<br />
2. Inspect your patch to make sure it is correct. It will be a patch against the divergence point, so realize that the base branch may have moved on and your patch may have to be updated when you apply it.<br />
<br />
== Apply your patches to a git repo ==<br />
Obtain a clone of the [[Accessing Lustre Code | Lustre repository]].<br />
git clone --origin prime git@git.lustre.org:prime/lustre <mydir><br />
cd <mydir><br />
Create your own private branch. For example, a branch for bug 20000 based off of HEAD:<br />
git checkout -b bug20000 master<br />
Apply the patch to that branch<br />
patch -p1 < hd_params_tree.patch<br />
Resolve any merge conflicts, and commit the patch (to your branch)<br />
git commit -a -v<br />
<br />
== Continue development ==<br />
Sun employees should continue development under git as per the [https://wikis.lustre.org/intra/index.php/Lustre_GIT Lustre GIT page].<br />
<br />
External contributors should follow the procedure for [[Submitting_Patches|submitting patches]].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Migrating_to_Git&diff=11749Migrating to Git2010-05-20T22:50:46Z<p>Adilger: /* Apply your patches to a git repo */ fix link to getting Lustre code</p>
<hr />
<div><small>''(Updated: Dec 2009)''</small><br />
<br />
To migrate ongoing work from CVS to git SCM, first convert your work to a patch, then apply that patch to a git tree.<br />
== Convert your work to patches ==<br />
=== Work that does ''NOT'' live in a private CVS branch ===<br />
If you maintain your development code with quilt or something else other than a CVS private branch, generate patches for any current work.<br />
cvs diff > my.patch<br />
<br />
=== Work from a private CVS branch ===<br />
1. In your branch CVS working tree, use 'cvs diff' against the base tree divergence point. Since you've been using the build/merge scripts, this is quite easy. For example, params tree branch:<br />
cvs diff -r HD_PARAMS_TREE_BASE > hd_params_tree.patch<br />
The merge scripts have kept the <branchname>_BASE tag updated to reflect the latest merge (don't use an old dated tag, use the one ending in _BASE).<br />
<br />
This cvs diff will include uncommitted changes in your working tree as well as all your committed code, so make sure your working directory tree is in the state you want. <br />
<br />
2. Inspect your patch to make sure it is correct. It will be a patch against the divergence point, so realize that the base branch may have moved on and your patch may have to be updated when you apply it.<br />
<br />
== Apply your patches to a git repo ==<br />
Obtain a clone of the [[Accessing Lustre Code]].<br />
git clone --origin prime git@git.lustre.org:prime/lustre <mydir><br />
cd <mydir><br />
Create your own private branch. For example, a branch for bug 20000 based off of HEAD:<br />
git checkout -b bug20000 master<br />
Apply the patch to that branch<br />
patch -p1 < hd_params_tree.patch<br />
Resolve any merge conflicts, and commit the patch (to your branch)<br />
git commit -a -v<br />
<br />
== Continue development ==<br />
Sun employees should continue development under git as per the [https://wikis.lustre.org/intra/index.php/Lustre_GIT Lustre GIT page].<br />
<br />
External contributors should follow the procedure for [[Submitting_Patches|submitting patches]].</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Lustre_Tuning&diff=10900Lustre Tuning2010-02-09T18:26:40Z<p>Adilger: /* Number of Inodes for OST */</p>
<hr />
<div>__TOC__<br />
Many options in Lustre™ are set by means of kernel module parameters. These parameters are contained in the ''modprobe.conf'' file (On SuSE, this may be ''modprobe.conf.local'').<br />
<br />
==OSS Service Thread Count==<br />
The ''oss_num_threads'' parameter allows the number of OST service threads to be<br />
specified at module load time on the OSS nodes:<br />
<br />
options ost oss_num_threads={N}<br />
<br />
An OSS can have a maximum of 512 service threads and a minimum of 2 service<br />
threads. The number of service threads is a function of how much RAM and how<br />
many CPUs are on each OSS node (1 thread / 128MB * num_cpus). If the load on the<br />
OSS node is high, new service threads will be started in order to process more<br />
requests concurrently, up to 4x the initial number of threads (subject to the<br />
maximum of 512). For a 2GB 2-CPU system, the default thread count is 32 and the<br />
maximum thread count is 128.<br />
<br />
Increasing the size of the thread pool may help when:<br />
* Several OSTs are exported from a single OSS<br />
* Back-end storage is running synchronously<br />
* I/O completions take excessive time due to slow storage<br />
<br />
Decreasing the size of the thread pool may help if:<br />
* The clients are overwhelming the storage capacity<br />
* There are lots of "slow I/O" or similar messages<br />
<br />
Increasing the number of I/O threads allows the kernel and storage to<br />
aggregate many writes together for more efficient disk I/O. The OSS thread pool is<br />
shared—each thread allocates approximately 1.5 MB (maximum RPC size + 0.5 MB)<br />
for internal I/O buffers.<br />
<br />
It is very important to consider memory consumption when increasing the thread<br />
pool size. Drives are only able to sustain a certain amount of parallel I/O activity<br />
before performance is degraded due to the high number of seeks and the OST<br />
threads just waiting for I/O. In this situation, it may be advisable to decrease the<br />
load by decreasing the number of OST threads.<br />
<br />
Determining the optimum number of OST threads is a process of trial and error. You<br />
may want to start with a number of OST threads equal to the number of actual disk<br />
spindles on the node. If you use RAID, subtract any dead spindles not used for<br />
actual data (e.g., 1 of N of spindles for RAID5, 2 of N spindles for RAID6), and<br />
monitor the performance of clients during usual workloads. If performance is degraded,<br />
increase the thread count and see how that works until performance is degraded again<br />
or you reach satisfactory performance.<br />
<br />
==MDS Threads==<br />
There is a similar parameter for the number of MDS service threads:<br />
<br />
options mds mds_num_threads={N}<br />
<br />
At this time, no testing has been done as to what the optimal number of MDS threads are. The default number varies based on the server size up to a maximum of 32. The maximum number of threads (''MDS_MAX_THREADS'') is 512.<br />
<br />
'''''Note:'''''<br />
The OSS and MDS will automatically start new service threads dynamically in response to server loading within a factor of 4. The default is calculated the same way as before.<br />
Setting the ''_mu_threads'' module parameter disables the automatic thread creation behavior.<br />
<br />
==LNET Tunables==<br />
''Transmit and receive buffer size:''<br />
With Lustre release 1.4.7 and later, ''ksocklnd'' now has separate parameters for the transmit and receive buffers.<br />
<br />
options ksocklnd tx_buffer_size=0 rx_buffer_size=0<br />
<br />
If these parameters are left at the default (0), the system automatically tunes the transmit and receive buffer size. In almost every case, the defaults produce the best performance. Do not attempt to tune this unless you are a network expert.<br />
<br />
''irq_affinity:''<br />
By default, this parameter is on. In the normal case on an SMP system, we would like our network traffic to remain local to a single CPU. This helps to keep the processor cache warm and minimizes the impact of context switches. This is especially helpful when an SMP system has more than one network interface and ideal when the number of interfaces equals the number of CPUs.<br />
<br />
If you have an SMP platform with a single fast interface such as 10GB Ethernet and more than two CPUs, you may see performance improve by turning this parameter off, as always test to compare the impact.<br />
<br />
=Options for Formatting MDS and OST=<br />
The backing file systems on the MDS and OSTs are independent of each other, so the formatting parameters for them should not be same. The size of the MDS backing file system depends solely on how many inodes you want in the total Lustre file system. It is not related to the size of the aggregate OST space.<br />
<br />
==Planning for Inodes==<br />
Every time you create a file on a Lustre file system, it consumes one inode on the MDS and one inode for each OST object that the file is striped over (normally it is based on the default stripe count option ''-c'', but this may change on a per-file basis). In ''ext3/ldiskfs'' file systems, inodes are pre-allocated, so creating a new file does not consume any of the free blocks. However, this also means that the format-time options should be conservative as it is not possible to increase the number of inodes after the file system is formatted. But it is possible to add OSTs with additional space and inodes to the file system.<br />
<br />
To be on the safe side, plan for 4KB per inode on the MDS. This is the default. For the OST, the amount of space taken by each object depends entirely upon the usage pattern of the users/applications running on the system. Lustre, by necessity, defaults to a very conservative estimate for the object size (16KB per object). You can almost always increase this for file system installations. Many Lustre file systems have average file sizes over 1MB per object.<br />
<br />
== Sizing the MDT ==<br />
When calculating the MDS size, the only important factor is the average size of files to be stored in the file system. If the average file size is, for example, 5MB and you have 100TB of usable OST space, then you need at least ''100TB * 1024GB/TB * 1024MB/GB / 5MB/inode = 20 million inodes''. We recommend that you have twice the minimum, that is, 40 million inodes in this example. At the default 4KB per inode, this works out to only 160GB of space for the MDS.<br />
<br />
Conversely, if you have a very small average file size, for example 4KB, Lustre is not very efficient. This is because you consume as much space on the MDS as you are consuming on the OSTs. This is not a very common configuration for Lustre.<br />
<br />
==Overriding Default Formatting Options==<br />
To override the default formatting options for any of the Lustre backing filesystems, use the ''--mkfsoptions='backing fs options''' argument to ''mkfs.lustre'' to pass formatting options to the backing ''mkfs''. For all options to format backing ''ext3'' and ''ldiskfs'' filesystems, see the ''mke2fs(8)'' man page; this section only discusses some Lustre-specific options.<br />
<br />
===Number of Inodes for MDS===<br />
To override the inode ratio, use the option ''-i <bytes per inode>'' (for instance, ''--mkfsoptions='-i 4096''' to create one inode per 4096 bytes of file system space). Alternately, if you are specifying some absolute number of inodes, use the ''-N<number of inodes>'' option. You should not specify the ''-i'' option with an inode ratio below one inode per 1024 bytes in order to avoid unintentional mistakes. Instead, use the ''-N'' option.<br />
<br />
A 2TB MDS by default will have 512M inodes. The largest currently-supported file system size is 8TB, which would hold 2B inodes. With an MDS inode ratio of 1024 bytes per inode, a 2TB MDS would hold 2B inodes, and a 4TB MDS would hold 4B inodes, which is the maximum number of inodes currently supported by ext3.<br />
<br />
===Inode Size for MDS===<br />
Lustre uses "large" inodes on the backing file systems in order to efficiently store Lustre metadata with each file. On the MDS, each inode is at least 512 bytes in size by default, while on the OST each inode is 256 bytes in size. Lustre (or more specifically the backing ''ext3'' file system), also needs sufficient space left for other metadata like the journal (up to 400MB), bitmaps and directories. There are also a few regular files that Lustre uses to maintain cluster consistency.<br />
<br />
To specify a larger inode size, use the ''-I <inodesize>'' option. We do NOT recommend specifying a smaller-than-default inode size, as this can lead to serious performance problems; and you cannot change this parameter after formatting the file system. The inode ratio must always be larger than the inode size.<br />
<br />
===Number of Inodes for OST===<br />
For OST file systems, it is normally advantageous to take local file system usage into account. Try and minimize the number of inodes created on each OST, while keeping enough margin for potential variance in future usage. This helps in reducing the format and ''e2fsck'' time, and makes more space available for data. The current default is to create one inode per 16KB of space in the OST file system, but in many environments, this is far too many inodes for the average file size. As a good rule of thumb, the OSTs should have at least:<br />
<br />
num_ost_inodes = 4 * <num_mds_inodes> * <default_stripe_count> / <number_osts><br />
<br />
You can specify the number of inodes on the OST file systems via the ''-N<num_inodes>'' option to ''--mkfsoptions''. Alternately, if you know the average file size, then you can also specify the OST inode count for the OST file systems via ''-i <average_file_size / (number_of_stripes * 4)>''. (For example, if the average file size is 16MB and there are by default 4 stripes per file then ''--mkfsoptions='-i 1048576''' would be appropriate).</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=10645Coding Guidelines2010-01-25T22:45:21Z<p>Adilger: /* Layout */</p>
<hr />
<div>== Beautiful Code == <br />
<br />
''A note from Eric Barton, our lead engineer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''req_portal'' and ''rep_portal'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* will break if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
right:<br />
if (foo)<br />
bar();<br />
wrong:<br />
if (foo) bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=10644Coding Guidelines2010-01-25T22:44:06Z<p>Adilger: /* Layout */</p>
<hr />
<div>== Beautiful Code == <br />
<br />
''A note from Eric Barton, our lead engineer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''req_portal'' and ''rep_portal'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* will break if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column, preferably with longer declarations at the top.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Variable declarations should be kept to an internal scope, if practical and reasonable, to simplify understanding of where these variables are used:<br />
<br />
<pre><br />
right:<br />
int len;<br />
<br />
if (len > 0) {<br />
int count;<br />
struct inode *inode = iget(foo);<br />
<br />
count = inode->i_size;<br />
:<br />
}<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
if (foo)<br />
bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
right:<br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
right:<br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
wrong:<br />
for( a=0; a<b; a++ )<br />
if( a<b || a==c )<br />
while( 1 )<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Coding_Guidelines&diff=10643Coding Guidelines2010-01-25T22:36:48Z<p>Adilger: /* C Language Features */</p>
<hr />
<div>== Beautiful Code == <br />
<br />
''A note from Eric Barton, our lead engineer:''<br />
<br />
More important than the physical layout of code (which is covered in detail below) is the idea that the code should be ''beautiful'' to read.<br />
<br />
What makes code beautiful to me? Fundamentally, it's readability and obviousness. The code must not have secrets but should flow easily, pleasurably and ''accurately'' off the page and into the mind of the reader.<br />
<br />
How do I think beautiful code is written? Like this...<br />
<br />
* The author must be confident and knowledgeable and proud of her work. She must understand what the code should do, the environment it must work in, all the combinations of inputs, all the valid outputs, all the possible races and all the reachable states. She must [http://en.wikipedia.org/wiki/Grok grok] it.<br />
<br />
* Names must be well chosen. The meaning a human reader attaches to a name can be orthogonal to what the compiler does with it, so it's just as easy to mislead as it is to inform. ''[http://en.wikipedia.org/wiki/Does_what_it_says_on_the_tin "Does exactly what it says on the tin"]'' is a popular UK English expression describing something that does ''exactly'' what it tells you it's going to do, no more and no less. For example, if I open a tin labeled "soap", I expect the contents to help me wash and maybe even smell nice. If it's no good at removing dirt, I'll be disappointed. If it removes the dirt but burns off a layer of skin with it, I'll be positively upset. The name of a procedure, a variable or a structure member should tell you something informative about the entity without misleading - just "what it says on the tin".<br />
<br />
* Names must be well chosen. Local, temporary variables can almost always remain relatively short and anonymous, while names in global scope must be unique. In general, the wider the context you expect to use the name in, the more unique and informative the name should be. Don't be scared of long names if they help to ''make_the_code_clearer'', but ''do_not_let_things_get_out_of_hand'' either - we don't write COBOL. Related names should be obvious, unambiguous and avoid naming conflicts with other unrelated names, e.g. by using a consistent prefix. This applies to all API procedures (if not all procedures period) within a given subsystem. Similarly, unique member names for global structures, using a prefix to identify the parent structure type, helps readability.<br />
<br />
* Names must be well chosen. Don't choose names that are easily confused - especially not if the compiler can't even tell the difference when you make a spelling mistake. ''i'' and ''j'' aren't the worst example - ''req_portal'' and ''rep_portal'' are much worse (and taken from our own code!!!).<br />
<br />
* Names must be well chosen. I can't emphasize this issue enough - I hope you get the point.<br />
<br />
* Assertions must be used intelligently. They combine the roles of ''active comment'' and ''software fuse''. As an ''active comment'' they tell you something about the program that you can trust more than a comment. And as a ''software fuse'', they provide fault isolation between subsystems by letting you know when and where invariant assumptions are violated. Overuse must be avoided - it hurts performance without helping readability - and any other use is just plain wrong. For example, assertions must '''never''' be used to validate data read from disk or the network. Network and disk hardware ''does'' fail and Lustre has to handle that - it can't just crash. The same goes for user input. Checking data copied in from userspace with assertions just opens the door for a denial of service attack.<br />
<br />
* Formatting and indentation rules should be followed intelligently. The visual layout of the code on the page should lend itself to being read easily and accurately - it just looks clean and good.<br />
** Separate "ideas" should be separated clearly in the code layout using blank lines that group related statements and separate unrelated statements.<br />
** Procedures should not ramble on. You must be able to take in the meaning of a procedure without scrolling past page after page of code or parsing deeply nested conditionals and loops. The 80-column rule is there for a reason.<br />
** Declarations are easier to refer to while scanning the code if placed in a block locally to, but visually separate from, the code that uses them. Readability is further enhanced by limiting declarations to one per line and aligning types and names vertically.<br />
** Parameters in multi-line procedure calls should be aligned so that they are visually contained by their brackets.<br />
** Brackets should be used in complex expressions to make operator precedence clear.<br />
** Conditional boolean (''if (expr)''), scalar (''if (val != 0)'') and pointer (''if (ptr != NULL)'') expressions should be written consistently.<br />
** Formatting and indentation rules should not be followed slavishly. If you're faced with either breaking the 80-chars-per-line rule or the parameter indentation rule or creating an obscure helper function, then the 80-chars-per-line rule might have to suffer. The overriding consideration is how the code reads.<br />
<br />
I could go on, but I hope you get the idea. Just think about the poor reader when you're writing, and whether your code will convey its meaning naturally, quickly and accurately, without room for misinterpretation. <br />
<br />
I didn't mention ''clever'' as a feature of beautiful code because it's only one step from ''clever'' to ''tricky'' - consider...<br />
<br />
t = a; a = b; b = t; /* dumb swap */<br />
<br />
a ^= b; b ^= a; a ^= b; /* clever swap */<br />
<br />
You could feel quite pleased that the clever swap avoids the need for a local temporary variable - but is that such a big deal compared with how quickly, easily and accurately the reader will read it? This is a very minor example which can almost be excused because the "cleverness" is confined to a tiny part of the code. But when ''clever'' code gets spread out, it becomes much harder to modify without adding defects. You can only work on code without screwing up if you understand the code ''and'' the environment it works in completely. Or to put it more succinctly...<br />
<br />
:''Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.'' - [http://en.wikipedia.org/wiki/Brian_Kernighan Brian W. Kernighan]<br />
<br />
IMHO, beautiful code helps code quality because it improves communication between the code author and the code reader. Since everyone maintaining and developing the code is a code reader as well as a code author, the quality of this communication can lead either to a virtuous circle of improving quality, or a vicious circle of degrading quality. You, dear reader, will determine which.<br />
<br />
----<br />
<br />
== Style and Formatting Guidlelines ==<br />
<br />
All of our rules for formatting, wrapping, parenthesis, brace placement, etc., are originally derived from the [http://www.kernel.org/doc/Documentation/CodingStyle Linux kernel rules], which are basically K&R style.<br />
<br />
=== Whitespace ===<br />
<br />
Whitespace gets its own section because unnecessary whitespace changes can cause spurious merge conflicts when code is landed and updated in a distributed development environment. Please ensure that you comply with the guidelines in this section to avoid these issues. We've included default formatting rules for emacs and vim to help make it easier.<br />
<br />
* No tabs should be used in any Lustre™, LNET or ''libcfs'' files. The exceptions are ''libsysio'' (maintained by someone else), ''ldiskfs'' and kernel patches (also part of a non-Lustre Group project).<br />
<br />
* Blocks should be indented 8 spaces.<br />
<br />
* New files should contain the following along with the license boilerplate. This will cause vim and emacs to use spaces instead of tabs for indenting. If you use a different editor, it also needs to be set to use spaces for indenting Lustre code.<br />
<pre><br />
/* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*-<br />
* vim:expandtab:shiftwidth=8:tabstop=8:<br />
*/<br />
</pre><br />
<br />
* All lines should wrap at 80 characters. If it's getting too hard to wrap at 80 characters, you probably need to rearrange conditional order or break it up into more functions.<br />
<pre><br />
right:<br />
<br />
void func_helper(...)<br />
{<br />
do_sth2_1;<br />
<br />
if (cond3)<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
<br />
do_sth2_2;<br />
}<br />
<br />
void func (...)<br />
{<br />
if (!cond1)<br />
return;<br />
<br />
do_sth1_1;<br />
<br />
if (cond 2)<br />
func_helper(...)<br />
<br />
do_sth1_2;<br />
}<br />
<br />
wrong:<br />
<br />
void func(...)<br />
{<br />
if (cond1) {<br />
do_sth1_1;<br />
if (cond2) {<br />
do_sth2_1;<br />
if (cond3) {<br />
do_sth_which_needs_a_very_long_line_to_read_clearly;<br />
}<br />
do_sth2_2;<br />
}<br />
do_sth1_2;<br />
}<br />
}<br />
<br />
</pre><br />
<br />
* Do not include spaces or tabs on blank lines or at the end of lines. Please ensure you remove all instances of these in any [[Submitting Patches|patches you submit to Bugzilla]]. You can find them with grep or in vim using the following regexps:<br />
<pre><br />
/[ \t]$/<br />
</pre><br />
<br />
:Alternatively, if you use vim, you can put this line in your vimrc file, which will highlight whitespace at the end of lines and spaces followed by tabs in indentation (only works for C/C++ files):<br />
<pre><br />
let c_space_errors=1<br />
</pre><br />
<br />
:Or you can use this command, which will make tabs and whitespace at the end of lines visible for all files (but a bit more discretely):<br />
<pre><br />
set list listchars=tab:>\ ,trail:$<br />
</pre><br />
<br />
:In emacs, you can use (whitespace-mode) or (whitespace-visual-mode) depending on the version. You could also consider using (flyspell-prog-mode).<br />
<br />
=== C Language Features ===<br />
<br />
* Don't use ''inline'' unless you're doing something so performance critical that the function call overhead will make a difference -- in other words: almost never. It makes debugging harder and overuse can actually hurt performance by causing instruction cache or stack overflow.<br />
<br />
* Use ''typedef'' carefully...<br />
** Do not create a new integer ''typedef'' without a good reason.<br />
** Always postfix ''typedef'' names with ''_t'' so that they can be identified clearly in the code.<br />
** ''Never'' ''typedef'' pointers. The ''*'' makes C pointer declarations obvious. Hiding it inside a ''typedef'' just obfuscates the code.<br />
<br />
* Do not embed assignments inside boolean expressions. Although this can make the code more concise, it doesn't necessarily make it more elegant and you increase the risk of confusing "=" with "==" or getting operator precedence wrong if you skimp on brackets. It's even easier to make mistakes when reading the code, so it's much safer simply to avoid it altogether.<br />
<pre><br />
right:<br />
ptr = malloc(size);<br />
if (ptr != NULL) {<br />
...<br />
<br />
wrong:<br />
if ((ptr = malloc(size)) != NULL) {<br />
...<br />
</pre><br />
<br />
* Conditional expressions read more clearly if only boolean expressions are implicit (i.e., non-boolean and pointer expressions compare explicitly with ''0'' and ''NULL'' respectively.)<br />
<pre><br />
right:<br />
if (!writing && /* not writing? */<br />
inode != NULL && /* valid inode? */<br />
ref_count == 0) /* no more references? */<br />
do_this();<br />
<br />
wrong:<br />
if (writing == 0 && /* not writing? */<br />
inode && /* valid inode? */<br />
!ref_count) /* no more references? */<br />
do_this();<br />
</pre><br />
<br />
* Use parentheses to help readability and reduce the chance of operator precedence errors, but not so heavily that it is difficult to determine which parentheses are a matched pair.<br />
<pre><br />
right:<br />
if (a->a_field == 3 ||<br />
((b->b_field & BITMASK1) && (c->c_field & BITMASK2)))<br />
do this();<br />
<br />
wrong:<br />
if (a->a_field == 3 || b->b_field & BITMASK1 && c->c_field & BITMASK2)<br />
do this()<br />
<br />
wrong:<br />
if (((a->a_field == 3) || ((b->b_field & (BITMASK1)) &&<br />
(c->c_field & (BITMASK2)))))<br />
do this()<br />
</pre><br />
<br />
=== Lustre Guidelines ===<br />
* Use ''list_for_each_entry()'' instead of ''list_for_each'' followed by ''list_entry''<br />
* When using ''sizeof()'' it should be used on the variable itself, rather than specifying the type of the variable, so that if the variable changes type/size then ''sizeof()'' will be correct:<br />
<pre><br />
right:<br />
int *array;<br />
<br />
OBD_ALLOC(array, 10 * sizeof(*array));<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(int)); /* will break if array becomes __u64 */<br />
<br />
wrong:<br />
OBD_ALLOC(array, 10 * sizeof(array)); /* This is the pointer size */<br />
<br />
</pre><br />
<br />
=== Layout ===<br />
<br />
* Code can be much more readable if the simpler actions are taken first in a set of tests. Re-ordering conditions like this also eliminates excessive nesting.<br />
<pre><br />
right:<br />
list_for_each_entry(...) {<br />
<br />
if (!condition1) {<br />
do_sth1;<br />
continue;<br />
}<br />
<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
<br />
if (!condition2)<br />
break;<br />
<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
}<br />
wrong:<br />
list_for_each_entry(...) {<br />
if (condition1) {<br />
do_sth2_1;<br />
do_sth2_2;<br />
...<br />
do_sth2_N;<br />
if (condition2) {<br />
do_sth3_1;<br />
do_sth3_2;<br />
...<br />
do_sth3_N;<br />
continue;<br />
} <br />
break;<br />
} else {<br />
do_sth1;<br />
}<br />
}<br />
</pre><br />
<br />
* Variable should be declared one per line, type and name, even if there are multiple variables of the same type. For maximum readability, the names should be aligned on the same column.<br />
<pre><br />
right:<br />
int len;<br />
int count;<br />
struct inode *inode;<br />
<br />
wrong:<br />
int len, count;<br />
struct inode *inode;<br />
</pre><br />
<br />
* Even for short conditionals, the operation should be on a separate line:<br />
<pre><br />
if (foo)<br />
bar();<br />
</pre><br />
<br />
* When you wrap a line containing parenthesis, start the next line after the parenthesis so that the expression or argument is visually bracketed.<br />
<pre><br />
right:<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument,<br />
foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
<br />
wrong:<br />
<br />
variable = do_something_complicated(long_argument, longer_argument,<br />
longest_argument(sub_argument, foo_argument),<br />
last_argument);<br />
<br />
if (some_long_condition(arg1, arg2, arg3) < some_long_value &&<br />
another_long_condition(very_long_argument_name,<br />
another_long_argument_name) ><br />
second_long_value) {<br />
...<br />
</pre><br />
<br />
* If you're wrapping an expression, put the operator at the end of the line. If there are no parentheses to which to align the start of the next line, just indent 8 more spaces.<br />
<pre><br />
off = le32_to_cpu(fsd->fsd_client_start) +<br />
cl_idx * le16_to_cpu(fsd->fsd_client_size);<br />
</pre><br />
<br />
* Binary and ternary (but not unary) operators should be separated from their arguments by one space.<br />
<pre><br />
a++;<br />
b |= c;<br />
d = (f > g) ? 0 : 1;<br />
</pre><br />
<br />
* Function calls should be nestled against the parentheses, the parentheses should crowd the arguments, and one space should appear after commas:<br />
<pre><br />
right: <br />
do_foo(bar, baz);<br />
<br />
wrong:<br />
do_foo ( bar,baz );<br />
</pre><br />
<br />
* Put a space between ''if'', ''for'', ''while'' etc. and the following parenthesis. Put a space after each semicolon in a ''for'' statement.<br />
<pre><br />
for (a = 0; a < b; a++)<br />
if (a < b || a == c)<br />
while (1)<br />
</pre><br />
<br />
* Opening braces should be on the same line as the line that introduces the block, except for function calls. Bare closing braces (i.e. not ''else'' or ''while'' in do/while) get their own line. <br />
<pre><br />
int foo(void)<br />
{<br />
if (bar) {<br />
this();<br />
that();<br />
} else if (baz) {<br />
stuff();<br />
} else {<br />
other_stuff();<br />
}<br />
<br />
do {<br />
cow();<br />
} while (condition);<br />
}<br />
</pre><br />
<br />
* If one part of a compound ''if'' block has braces, all should.<br />
<pre><br />
right:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else {<br />
salmon();<br />
}<br />
<br />
wrong:<br />
if (foo) {<br />
bar();<br />
baz();<br />
} else<br />
moose();<br />
</pre><br />
<br />
* When you define a macro, protect callers by placing parentheses round every parameter reference in the body. Line up the backslashes of multi-line macros to help readability. Use a do/while (0) block with ''no'' trailing semicolon to ensure multi-statement macros are syntactically equivalent to procedure calls.<br />
<pre><br />
/* right */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = (a) + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0)<br />
<br />
/* wrong */<br />
#define DO_STUFF(a) \<br />
do { \<br />
int b = a + MAGIC; \<br />
do_other_stuff(b); \<br />
} while (0);<br />
</pre><br />
<br />
* If you write conditionally compiled code in a procedure body, make sure you do not create unbalanced braces, quotes, etc. This really confuses editors that navigate expressions or use fonts to highlight language features. It can often be much cleaner to put the conditionally compiled code in its own helper function which, by good choice of name, documents itself too.<br />
<pre><br />
/* right */<br />
static inline int invalid_dentry(struct dentry *d)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
return d->d_flags & DCACHE_LUSTRE_INVALID;<br />
#else<br />
return d_unhashed(d);<br />
#endif<br />
}<br />
<br />
int do_stuff(struct dentry *parent)<br />
{<br />
if (invalid_dentry(parent)) {<br />
...<br />
<br />
/* wrong */<br />
int do_stuff(struct dentry *parent)<br />
{<br />
#ifdef DCACHE_LUSTRE_INVALID<br />
if (parent->d_flags & DCACHE_LUSTRE_INVALID) {<br />
#else<br />
if (d_unhashed(parent)) {<br />
#endif<br />
...<br />
</pre><br />
<br />
* If you nest preprocessor commands, use spaces to visually delineate:<br />
<pre><br />
#ifdef __KERNEL__<br />
# include <goose><br />
# define MOOSE steak<br />
#else<br />
# include <mutton><br />
# define MOOSE prancing<br />
#endif<br />
</pre><br />
<br />
* For very long #ifdefs, include the conditional with each #endif to make it readable:<br />
<pre><br />
#ifdef __KERNEL__<br />
# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,5,0)<br />
/* lots<br />
of<br />
stuff */<br />
# endif /* KERNEL_VERSION(2,5,0) */<br />
#else /* !__KERNEL__ */<br />
# if HAVE_FEATURE<br />
/* more<br />
* stuff */<br />
# endif<br />
#endif /* __KERNEL__ */<br />
</pre><br />
<br />
* Comments should have the leading '/*' on the same line as the comment and the trailing '*/' at the end of the last comment line. Intermediate lines should start with a '*' aligned with the '*' on the first line:<br />
<pre><br />
/* This is a short comment */<br />
<br />
/* This is a multi-line comment. I wish the line would wrap already,<br />
* as I don't have much to write about. */<br />
</pre><br />
<br />
* Function declarations absolutely should NOT go into .c files, unless they are forward declarations for static functions that can't otherwise be moved before the caller. Instead, the declaration should go into the most "local" header available (preferably *_internal.h for a given piece of code).<br />
<br />
* Structure and constant declarations should not be declared in multiple places. Put the struct into the most "local" header possible. If it is something that is passed over the wire, it needs to go into lustre_idl.h and needs to be correctly swabbed when the RPC message is unpacked.<br />
<br />
* The types and printf/printk formats used by Lustre code are:<br />
<pre><br />
__u64 LPU64/LPX64/LPD64 (unsigned, hex, signed)<br />
size_t LPSZ (or cast to int and use %u / %d)<br />
__u32/int %u/%x/%d (unsigned, hex, signed)<br />
(unsigned) long long %llu/%llx/%lld<br />
loff_t %lld after a cast to long long (unfortunately)<br />
</pre><br />
<br />
* For Autoconf macros, follow the [http://www.gnu.org/software/autoconf/manual/html_node/Coding-Style.html style suggested in the autoconf manual].<br />
<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment], [ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
:or_even<br />
<pre><br />
AC_CACHE_CHECK([for EMX OS/2 environment],<br />
[ac_cv_emxos2],<br />
[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],<br />
[return __EMX__;])],<br />
[ac_cv_emxos2=yes],<br />
[ac_cv_emxos2=no])])<br />
</pre><br />
<br />
----</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Accessing_Lustre_Code&diff=9971Accessing Lustre Code2010-01-15T18:41:43Z<p>Adilger: remove mention of initial seeding torrent</p>
<hr />
<div>'''''NOTICE:''''' The transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
<br />
We welcome and encourage contributions to the development and testing of a more robust, feature-rich Lustre™. You can obtain the latest bleeding-edge Lustre source code by anonymous Git access.<br />
<br />
git clone git://git.lustre.org/prime/lustre.git <br />
<br />
'''''Note:''''' For more information about using Git, including tutorials and guides to help you get started, see the [http://git-scm.com/documentation Git documentation] page. ''For descriptions of the commands you are most likely to need, see the Commands section at the bottom of this page.''<br />
<br />
See [[Contribute]] for more information about developing, testing, and submitting a patch to the Lustre code.<br />
<br />
'''''Note:''''' If you have questions or experience problems, send email to the [mailto:lustre-wiki-feedback@sun.com Admins].<br />
<br />
For more information about Git, see the [http://git-scm.com/ Git home]<br />
<br />
=== Naming conventions ===<br />
<br />
Stable development branches are named b''{major}''_''{minor}'' (for example, b1_6 and b1_8). Even-numbered minor releases are considered stable releases. Odd-numbered minor releases correspond to alpha and beta releases and will sometimes be given v''{major}''_''{minor}''_''{patch}'' tags to provide a point of reference for internal and external testing. <br />
<br />
A release branch is created an official release to isolate it from further development and named b_release_''{major}''_''{minor}''_''{patch}'' (for example, b_release_1_8_0). A final release gets a tag in the form v''{major}''_''{minor}''_''{patch}'' (for example, v1_8_0 or v1_6_7_1).<br />
<br />
Work for the next upcoming version is done on the ''master'' branch.<br />
<br />
Lustre [[Subsystem Map]] describes each of the subsystems in the Lustre code.</div>Adilgerhttp://wiki.old.lustre.org/index.php?title=Applying_Lustre_Patches_to_a_Kernel&diff=9645Applying Lustre Patches to a Kernel2010-01-12T06:29:06Z<p>Adilger: /* Overview of Lustre Patches */</p>
<hr />
<div>'''''NOTICE:''''' The transition from CVS to Git took place on Monday, December 14. For more information about the transition, see the [[Git Transition Notice]]. For details about how to migrate to Git, see [[Migrating to Git]].<br />
<br />
----<br />
__TOC__<br />
This page describes how to apply Lustre™ kernel patches to a tree, how to use ''Quilt'' (a package provided with most Linux distributions) to manage changes to patches, and how to modify an existing kernel patch or contribute a new kernel patch. <br />
<br />
== Overview of Lustre Patches ==<br />
To support Lustre development and functionality, some changes must be made to the core Linux kernel. These changes are organized in a set of kernel patches kept in the Lustre repository in the directory ''lustre/kernel_patches/patches/''.<br />
<br />
For a given Linux distribution, such as RHEL5 or SLES10, the<br />
corresponding kernel karget file specifies details about the kernel for<br />
which Lustre is being built. These kernel target definitions are<br />
updated by the Lustre Engineering team whenever the supported kernel<br />
version changes.<br />
<br />
For example, the file ''lustre/kernel_patches/targets/2.6-rhel5.target.in'' contains<br />
the following information:<br />
* Currently-supported kernel version (e.g., 2.6.18-128.7.1.el5)<br />
* Supported build architectures (e.g., i686, x86_64, ia64)<br />
* Name of the correct kernel patch series (e.g., 2.6-rhel5)<br />
* Version of OFED that will be used to build Infiniband drivers<br />
<br />
The ''vanilla'' target is special in that it does not correspond to any<br />
specific Linux distribution. The target describes the latest unmodified<br />
kernel.org kernel which has been tested to work with this version of<br />
Lustre.<br />
<br />
The patches to be applied depend on the kernel that is to be used. A series file is created in ''lustre/kernel_patches/series/'' for each supported kernel to define and control the patches to be used for that kernel.<br />
<br />
For example, the file ''lustre/kernel_patches/series/2.6-rhel5.series'' lists all the patches that must be applied to a Red Hat 2.6.18 kernel to build a Lustre compatible kernel. An excerpt from the current ''2.6-rhel5.series'' is shown below:<br />
<pre><br />
lustre_version.patch<br />
vfs_races-2.6-rhel5.patch<br />
i_filter_data.patch<br />
jbd-jcberr-2.6.18-vanilla.patch<br />
export_symbols-2.6.18-vanilla.patch<br />
...<br />
quota-large-limits-rhel5.patch<br />
</pre><br />
<br />
'''''Note:''''' For more information about the set of patches developed to address issues with RAID-5 that are included in the 2.6-rhel5.series file, see [[RAID5 Patches]].<br />
<br />
== Introduction to the Quilt package ==<br />
<br />
The ''Quilt'' package can be used to manage many patches on a single source tree. You will need ''Quilt'' to apply and manage Lustre kernel patches. A general overview of how this works is as follows:<br />
<br />
* A series file lists an ordered collection of patches.<br />
* The patches in the series form a stack.<br />
* Quilt can be used to push and pop the patches.<br />
* When the stack is managed with Quilt, patches can be edited and refreshed (updated).<br />
* Inadvertent changes can be reverted and patches forked or cloned. Diffs allow before and after change comparisons.<br />
<br />
''Quilt'' is included in most Linux distributions and can be installed using a package management utility such as yum or apt-get. It can also be downloaded from the [http://savannah.nongnu.org/projects/quilt Quilt Project Site].<br />
<br />
== Applying Lustre Kernel Patches to a Tree ==<br />
<br />
After you have checked out Lustre source code (see [[Accessing Lustre Code]]) and run the ''autogen'' script (see [[Building Lustre Code|Building Lustre Code]]), follow these steps to apply the appropriate Lustre kernel patches to your tree.<br />
<br />
==== Preparing to apply patches ====<br />
<br />
1. ''Select a series file.'' First, choose the correct kernel target (see [[#Overview of Lustre Patches|Overview of Lustre Patches]]) for your distribution and then determine the corresponding series file. <br />
<br />
2. ''Unpack a kernel source tree'' (supported kernel sources can be found at http://downloads.lustre.org/public/kernels/). For example, enter:<br />
<pre><br />
tar -xf linux-2.6.18-128.1.1-el5.tar.bz2<br />
</pre><br />
The resulting source tree, referred to as the "destination tree" may be located in, for example, ''/tmp/kernels/linux-2.6.18-128.1.1''.<br />
<br />
3. ''Choose a'' .config file ''from the directory'' lustre/kernel_patches/kernel_configs''.'' Each ''.config'' file corresponds to a supported kernel and contains the supported kernel build configuration for that kernel. <br />
<br />
4. ''Copy the selected kernel config to the root of the kernel source tree'', ensuring that the final file name is ''.config'':<br />
<pre><br />
cp<br />
lustre/kernel_patches/kernel_configs/kernel-2.6.18-2.6-rhel5-x86_64-smp.config /tmp/kernels/linux-2.6.18-128.1.1/.config<br />
</pre><br />
<br />
==== Applying the patches ====<br />
<br />
You will need ''Quilt'' to setup the series to use for your kernel. Complete the steps below:<br />
<br />
1. ''Add two symbolic links (symlinks) to your linux source tree.'' In this example, you will add:<br />
<br />
:* symlink series -> ../lustre/kernel_patches/series/2.6-rhel5.series<br />
:* symlink patches -> ../lustre/kernel_patches/patches<br />
<br />
To add the symlinks, enter:<br />
<pre><br />
# cd /usr/src/linux-2.6.18-128.1.1<br />
# ln -s ../lustre/kernel_patches/series/2.6-rhel5.series series<br />
# ln -s ../lustre/kernel_patches/patches patches<br />
</pre><br />
<br />
2. ''Apply the patches to the kernel source tree.'' <br />
<br />
<pre><br />
# cd /usr/src/linux-2.6.18-128.1.1<br />
# quilt push -av<br />
</pre><br />
The patched Linux source tree is now suitable for use during the Lustre server build process.<br />
<br />
==== Building and installing a patched kernel ====<br />
<br />
After successfully applying the Lustre kernel patches, you will need to build a new kernel in order to proceed with the Lustre build process. The new kernel must be installed and running before any Lustre server components can be used.<br />
<br />
'''''Note:''''' The kernel patch, build, and install process does not need to be<br />
repeated unless the Lustre kernel patch set changes.<br />
<br />
1. ''To build the new kernel, enter'':<br />
<br />
<pre><br />
# cd /usr/src/linux-2.6.18-128.1.1<br />
# make oldconfig<br />
# make bzImage<br />
# make modules<br />
</pre><br />
<br />
Completing these steps will result in a new Linux kernel (''vmlinuz'') and its<br />
associated modules. <br />
<br />
'''''Note:''''' Installation of the new kernel is beyond the scope of this document; please consult your distribution vendor's documentation for details.<br />
<br />
== Lustre Kernel Patch Development ==<br />
<br />
If you are going to be modifying existing kernel patches or contributing new kernel patches, follow the procedures in this section. <br />
<br />
'''''Note:''''' If you plan to submit your modifications to Lustre Engineering for<br />
possible inclusion in future product releases, please be sure to follow<br />
the procedures described below.<br />
<br />
'''''Note:''''' As a general guideline, limit the scope of changes in a patch file to a group of related changes.<br />
<br />
=== Directory Layout ===<br />
<br />
Patches are stored in the Lustre directory tree as follows:<br />
<br />
* ''patches/'' - Contains all the patch files. Each patch should correspond to a single functional change.<br />
* ''series/'' - The text files that ''patch-utils'' use to define the order that patches are applied to a tree. A series file exists for each distinctive variant of a kernel tree (corresponding to the source, such as kernel.org or Red Hat).<br />
<br />
=== Naming completed patches ===<br />
When naming patches, follow these guidelines :<br />
* Use the format <patchname><kernel-version>''.patch'' (for example, ''vfs_intent-2.4.20-rh.patch''). Patches are stored in ''lustre/kernel_patches/patches''.<br />
<br />
When updating patches, follow these guidelines:<br />
* Keep new versions of the patches as similar to the old ones as possible, so that ''git diff'' will clearly show the changes that have been made to the patch. <br />
* Keep the different versions of each patch as close as possible so that ''diff -u foo-rhel4.patch foo-sles10.patch'' will show as small a diff as possible and we can verify that each of the patches contains the fixes that have been applied to the others.<br />
<br />
To make this easier, options can be passed to ''quilt'' via ''$HOME/.quiltrc'' (copied from ''build/quiltrc'':<br />
<br />
export QUILT_DIFF_OPTS="-upa"<br />
export QUILT_NO_DIFF_TIMESTAMPS=1<br />
<br />
=== Maintaining series files ===<br />
The following conventions apply to series files:<br />
<br />
* A series file lists patches that are part of the series.<br />
* By convention, a series file supports ''one'' kernel, not two or more kernels.<br />
* When possible, patches should be applicable to multiple kernels to minimize the total number of series-specific patches.<br />
<br />
=== Fixing a bug involving a kernel patch ===<br />
<br />
The following example illustrates how to fix a bug that involves a kernel patch. In this example, the solution to the bug requires a change to a patch that affects ''fs/ext3/iopen.c''. <br />
<br />
Complete these steps: <br />
<br />
1. Check the series file to find the name of the patch, in this case, ''iopen-2.6-rhel5''.<br />
<br />
2. Pop to the patch by entering:<br />
<pre><br />
quilt pop iopen-2.6-rhel5<br />
</pre><br />
<br />
3. Make the fix.<br />
<br />
4. Update the patch in the lustre source by entering (where -o starts a new session):<br />
<pre><br />
quilt refresh -o<br />
</pre><br />
<br />
5. Show the changes to fix the bug that were made in this session by entering:<br />
<pre><br />
quilt gendiff > developers-fix.diff<br />
</pre><br />
<br />
6. Put the other patches back by entering:<br />
<pre><br />
quilt push -a<br />
</pre><br />
<br />
7. Test the code. See [[Testing Lustre Code]] for details. <br />
<br />
8. When you have built and tested your changes by running the ''acceptance-small'' test<br />
suite,submit your patch for review and possible inclusion in an upcoming Lustre release. See [[Submitting Patches]] for details.<br />
<br />
The Lustre release engineering team will then apply ''developers-fix.diff'' to the ''iopen.c'' patches in each of the series with that patch. The QA team will then test each of the affected the series.<br />
<br />
=== Upgrading a kernel ===<br />
<br />
To upgrade a kernel, following this procedure. In the example below, the new kernel is 2.6.25 and the patches to be applied are for 2.6.24.<br />
<br />
1. Start pushing patches until the patch that fails is topmost (for example, ''foo-2.6.20.patch'').<br />
<br />
2. If it is used in another series (for example, the ''sles'' series), fork it by entering:<br />
:<pre><br />
;quilt fork foo-2.6.25-rhel.patch<br />
</pre><br />
<br />
3. Force it in by entering:<br />
:<pre><br />
;quilt push -f<br />
</pre><br />
<br />
4. Fix conflicts.<br />
<br />
5. Update the patch in the source by entering:<br />
:<pre><br />
;quilt refresh<br />
</pre><br />
<br />
6. Verify that the series file has been updated with the new patch name.<br />
<br />
'''''Note:''''' To back out a forced patch, enter:<br />
<pre><br />
quilt pop -R -f<br />
</pre><br />
<br />
=== Adding a new file to the kernel ===<br />
To add a new file to the kernel:<br />
<br />
1. To make sure the top patch is the patch you want, use ''quilt push {patch_name}'' or ''quilt pop {patch_name}''.<br />
<br />
2. Edit the new file.<br />
<br />
3. To add the file to one of your patches, call ''quilt add [-p {patch_name}] {file_name}''. (If the top patch is the patch to which you want to add your new file, you can omit the '' -p {patch_name} '' option.)<br />
<br />
=== Making changes in another source file ===<br />
To make changes to another source file:<br />
<br />
1. Make sure the top patch is the patch you want using ''quilt push {patch_name}'' or ''quilt pop {patch_name}''.<br />
<br />
2. To add that source file to the top patch, call ''quilt add {that_source_name}''.<br />
<br />
3. Make changes in the source file.<br />
<br />
4. To refresh the patch, call ''quilt refresh''.<br />
<br />
=== Adding a patch into a series ===<br />
<br />
To add a patch into a series:<br />
<br />
1. Ideally, add the patch to the end of the series using ''quilt add'' to avoid the risk of cascading patch modifications.<br />
<br />
2. After the patch is imported, apply and refreshed it using ''quilt push'' and ''quilt refresh''. <br />
<br />
3. Verify that the series file was updated.<br />
<br />
4. Add the patch to the repository with "git add". <br />
<br />
'''''Note:''''' If you are introducing a new patch, use ''quilt new'', then edit the patch, and then use ''quilt refresh''.</div>Adilger