Architecture - ZFS TinyZAP

Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.

Definitions

ZAP: ZFS Attribute Processor, a hashed name=value lookup table that can be used to do efficient and scalable attribute storage
MicroZAP: a form of ZAP used by ZFS that only allows value to be a __u64
TinyZAP: a compact ZAP format that allows arbitrary values to be stored
FatZAP: a form of ZAP used by ZFS that allows arbitrary name/value pairs but (as name implies) consumes a lot of space

Use Cases

id	quality attribute	summary
fast_ea	performance	TinyZAP must be flexible to allow storage in arbitrary-sized containers such as large dnode or ancillary dbuf, and not waste too much space per attribute
mdt_fid	performance	ZAP must allow FID storage for MDT directories in a manner that is compatible with ZFS directories
compatible-zfs	usability	integration should be done at the DMU level so that any ZAP users (including ZFS) can use TinyZAPs to store name/value data

Efficient EA storage

Scenario:		Efficient EA storage in large dnode
Business Goals:		Fast access to Lustre EA values
Relevant QA's:		Performance
details	Stimulus:	EA needs to be stored in large dnode
	Stimulus source:	Lustre OSD (MDT/OST) storing EA data to object
	Environment:	EA being stored on a specific dnode within a transaction
	Artifact:	EA is stored in the dnode in ZAP
	Response:	size of EA in ZAP
	Response measure:	overhead should not be more than approximately 128 bytes + 48 bytes/EA
Questions:
Issues:		None.

MDT FIDs in directories

Scenario:		Storage of MDT FIDs in directories
Business Goals:		Able to store FID data along with name and (in some cases) local DMU objid in a single directory entry to avoid extra FID lookup overhead
Relevant QA's:		Performance
details	Stimulus:	Storing MDT FID in a directory entry
	Stimulus source:	Lustre MDT creating new directory entry
	Environment:	Directory entry being created within a transaction with objid and FID
	Artifact:	FID is stored in directory entry after __u64 objid value
	Response:	size of EA in ZAP
	Response measure:	overhead should not be more than approximately 128 bytes + 48 bytes/EA
Questions:
Issues:		None.

ZFS compatible directories

Scenario:		ZFS reading TinyZAP directory with FID
Business Goals:		Compatible with (possibly modified) ZFS code
Relevant QA's:		Usability
details	Stimulus:	ZFS reading TinyZAP entry
	Stimulus source:	ZFS doing name lookup in directory
	Environment:	Lustre MDT mounted with ZFS for diagnostic reasons
	Artifact:	ZFS is able to read directory and find objid (if local)
	Response:	ZFS does not fail in object lookup
	Response measure:
Questions:		What should stored in the objid in a CMD environment where the object is on a remote MDT?
Issues:		There will be some small amount of code change needed in ZFS to only access the first __u64 of the value, because it currently only provides enough space to return a single __u64 of data. We will also need to handle the case for a remote MDT by either storing objid=0 or some other well-defined value.

Implementation constraints

TinyZAP needs to be flexible enough to store arbitrary name/value data, including both Lustre LOV EA, and also MDT directories with extended FID data. Using a MicroZAP is not possible because this only allows storage of a single __u64 value with each entry. Using a FatZAP is wasteful as it requires a full block just for the header and a separate block for the leaf data.

A preferred implementation would have a structure similar to the existing zap_leaf_{phys_t,chunk} for the TinyZAP, since the leaf structure is reasonably compact, and may avoid a large amount of almost-identical code in the ZAP.

ZFS should be adapted (if necessary) to be able to handle directories created with TinyZAP layout, so they can can get the objid from the first __u64 and ignore the FID component of the directory entry.

The current ZAP implementation uses an object set and object number as parameters and we will need to interface using a buffer that might be located in the dnode or in an external block. So this might require some refactoring of the ZAP code.

This needs to handle endian swabbing issues correctly, as does all ZFS code.

Questions and Issues

Should we "wrap" the FID data after the DMU object id in an MDT directory so that it is possible in the future to add other extra data in a directory without compromising compatibility? Something like:

#define ZAP_LUSTRE_FID 0x110f1d0f1d0f1d10 struct zap_dir_fid { __u64 zdf_magic; struct lu_fid zdf_fid; /* or other data as appropriate */ }; #define zdf_len (zdf_magic & 0xff). This means we can skip (possibly unknown) extra directory info by skipping (zdf_len) bytes at a time looking for zdf_magic == ZAP_LUSTRE_FID.

References

ZFS large dnodes

http://www.opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf

WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

Architecture - ZFS TinyZAP

Contents

Definitions

Use Cases

Efficient EA storage

MDT FIDs in directories

ZFS compatible directories

Implementation constraints

Questions and Issues

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools