WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.
Architecture - ZFS TinyZAP
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
Definitions
- ZAP
- ZFS Attribute Processor, a hashed name=value lookup table that can be used to do efficient and scalable attribute storage
- MicroZAP
- a form of ZAP used by ZFS that only allows value to be a __u64
- TinyZAP
- a compact ZAP format that allows arbitrary values to be stored
- FatZAP
- a form of ZAP used by ZFS that allows arbitrary name/value pairs but (as name implies) consumes a lot of space
Use Cases
id | quality attribute | summary |
---|---|---|
fast_ea | performance | TinyZAP must be flexible to allow storage in arbitrary-sized containers such as large dnode or ancillary dbuf, and not waste too much space per attribute |
mdt_fid | performance | ZAP must allow FID storage for MDT directories in a manner that is compatible with ZFS directories |
compatible-zfs | usability | integration should be done at the DMU level so that any ZAP users (including ZFS) can use TinyZAPs to store name/value data |
Efficient EA storage
Scenario: | Efficient EA storage in large dnode | |
Business Goals: | Fast access to Lustre EA values | |
Relevant QA's: | Performance | |
details | Stimulus: | EA needs to be stored in large dnode |
Stimulus source: | Lustre OSD (MDT/OST) storing EA data to object | |
Environment: | EA being stored on a specific dnode within a transaction | |
Artifact: | EA is stored in the dnode in ZAP | |
Response: | size of EA in ZAP | |
Response measure: | overhead should not be more than approximately 128 bytes + 48 bytes/EA | |
Questions: | ||
Issues: | None. |
MDT FIDs in directories
Scenario: | Storage of MDT FIDs in directories | |
Business Goals: | Able to store FID data along with name and (in some cases) local DMU objid in a single directory entry to avoid extra FID lookup overhead | |
Relevant QA's: | Performance | |
details | Stimulus: | Storing MDT FID in a directory entry |
Stimulus source: | Lustre MDT creating new directory entry | |
Environment: | Directory entry being created within a transaction with objid and FID | |
Artifact: | FID is stored in directory entry after __u64 objid value | |
Response: | size of EA in ZAP | |
Response measure: | overhead should not be more than approximately 128 bytes + 48 bytes/EA | |
Questions: | ||
Issues: | None. |
ZFS compatible directories
Scenario: | ZFS reading TinyZAP directory with FID | |
Business Goals: | Compatible with (possibly modified) ZFS code | |
Relevant QA's: | Usability | |
details | Stimulus: | ZFS reading TinyZAP entry |
Stimulus source: | ZFS doing name lookup in directory | |
Environment: | Lustre MDT mounted with ZFS for diagnostic reasons | |
Artifact: | ZFS is able to read directory and find objid (if local) | |
Response: | ZFS does not fail in object lookup | |
Response measure: | ||
Questions: | What should stored in the objid in a CMD environment where the object is on a remote MDT? | |
Issues: | There will be some small amount of code change needed in ZFS to only access the first __u64 of the value, because it currently only provides enough space to return a single __u64 of data. We will also need to handle the case for a remote MDT by either storing objid=0 or some other well-defined value. |
Implementation constraints
TinyZAP needs to be flexible enough to store arbitrary name/value data, including both Lustre LOV EA, and also MDT directories with extended FID data. Using a MicroZAP is not possible because this only allows storage of a single __u64 value with each entry. Using a FatZAP is wasteful as it requires a full block just for the header and a separate block for the leaf data.
A preferred implementation would have a structure similar to the existing zap_leaf_{phys_t,chunk} for the TinyZAP, since the leaf structure is reasonably compact, and may avoid a large amount of almost-identical code in the ZAP.
ZFS should be adapted (if necessary) to be able to handle directories created with TinyZAP layout, so they can can get the objid from the first __u64 and ignore the FID component of the directory entry.
The current ZAP implementation uses an object set and object number as parameters and we will need to interface using a buffer that might be located in the dnode or in an external block. So this might require some refactoring of the ZAP code.
This needs to handle endian swabbing issues correctly, as does all ZFS code.
Questions and Issues
Should we "wrap" the FID data after the DMU object id in an MDT directory so that it is possible in the future to add other extra data in a directory without compromising compatibility? Something like:
#define ZAP_LUSTRE_FID 0x110f1d0f1d0f1d10 struct zap_dir_fid { __u64 zdf_magic; struct lu_fid zdf_fid; /* or other data as appropriate */ }; #define zdf_len (zdf_magic & 0xff). This means we can skip (possibly unknown) extra directory info by skipping (zdf_len) bytes at a time looking for zdf_magic == ZAP_LUSTRE_FID.
References
http://www.opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf