Architecture - HSM Migration

Purpose
This page describes use cases and high-level architecture for migrating files between Lustre and a HSM system.

Definitions

 * Trigger : A process or event in the file system which causes a migration to take place (or be denied).
 * Coordinator : A service coordinating migration of data.
 * Agent : A service used by coordinators to move data or cancel such movement.
 * Mover : The userspace component of the agent which copies the file between Lustre and the HSM storage.
 * Copy tool : HSM-specific component of the mover. (May be the entire mover.)
 * I/O request : This term groups read requests, write requests and other metadata accesses like truncate or unlink.
 * Resident : A file whose working copy is in Lustre.
 * Release : A released file's data has been removed by Lustre after being copied to the HSM. The MDT retains metadata info for the file.
 * Archive : An archived file's data resides in the HSM. File data may or may not also reside in Lustre.  The MDT retains metadata info for the file.
 * Restore : Copy a file from the HSM back into Lustre to make an Archived file Resident.
 * Prestage : An explicit call (from User or Policy Engine) to Restore a Released file

Coordinator

 * 1) dispatches requests to agents; chooses agents
 * 2) restore 
 * 3) archive 
 * 4) unlink 
 * 5) abort_action
 * 6) consolidates repeat requests
 * 7) re-queues requests to a new agent if an agent becomes unresponsive (aborts old request)
 * 8) agents send regular progress updates to coordinator (e.g. current extent)
 * 9) coordinator periodically checks for stuck threads
 * 10) coordinator requests are persistent
 * 11) all requests coming to the coordinator are kept in llog, cancelled when complete or aborted
 * 12) kernel-space service, MDT acts as initiator for copyin
 * 13) ioctl interface for all requests.  Initiators are policy engine, administrator tool, or MDT for cache-miss.
 * 14) Location: a coordinator will be directly integrated with each MDT
 * 15) Agents will communicate via MDC
 * 16) Connection/reconnection already taken care of; no additional pinging, config
 * 17) Client mount option will indicate "agent", connect flag will inform MDT
 * 18) MDT already has intimate knowledge of HSM bits (see below) and needs to communicate with coordinator anyhow
 * 19) HSM comms can use a new portal and reuse MDT threads.
 * 20) Coordinators will handle the same namespace segment as each MDT under CMD

MDT changes

 * 1) Per-file layout lock
 * 2) A new layout lock is created for every file.  The lock contains a layout version number.
 * 3) Private writer lock is taken by the MDT when allocating/changing file layout (LOV EA).
 * 4) The lock is not released until the layout change is complete and the data exist in the new layout.
 * 5) The MDT will take group extent locks for the entire file.  The group ID will be passed to the agent performing the data transfer.
 * 6) The current layout version is stored by the OSTs for each object in the layout.
 * 7) Shared reader locks are taken by anyone reading the layout (client opens, lfs getstripe) to get the layout version.
 * 8) Anyone taking a new extent lock anywhere in the file includes the layout version. The OST will grant an extent lock only if the layout version included in the RPC matches the object layout version.
 * 9) lov EA changes
 * 10) flags
 * 11) hsm_released: file is not resident on OSTs; only in HSM
 * 12) hsm_exists: some version of this fid exists in HSM; maybe partial or outdated
 * 13) hsm_dirty: file in HSM is out of date
 * 14) hsm_archived: a full copy of this file exists in HSM; if not hsm_dirty, then the HSM copy is current.
 * 15) The hsm_released flag is always manipulated under a write layout lock, the other flags are not.
 * 16) new ioctls for HSM control:
 * 17) HSM_REQUEST: policy engine or admin requests (archive, release, restore, remove, cancel) 
 * 18) HSM_STATE_GET: user requests HSM status information on a single file
 * 19) HSM_STATE_SET: user sets HSM policy flags for a single file (HSM_NORELEASE, HSM_NOARCHIVE)
 * 20) HSM_PROGRESS: copytool reports periodic state of a single request (current extent, error)
 * 21) HSM_TAPEFILE_ADD: add an existing archived file into the Lustre filesystem (only metadata is copied).
 * 22) changelogs:
 * 23) new events for HSM event completion
 * 24) restore_complete
 * 25) archive_complete
 * 26) unlink_complete
 * 27) per-event flags used by HSM
 * 28) setattr: data_changed (actually mtime_changed for V1)
 * 29) archive_complete: hsm_dirty
 * 30) all HSM events: hsm_failed

Agent
An agent manages local HSM requests on a client.
 * 1) one agent per client max; most clients will not have agents
 * 2) consists of two parts
 * 3) kernel component receives messages from the coordinator (LNET comms)
 * 4) agents and coordinator piggyback comms on MDC/MDT: connections, recovery, etc.
 * 5) coordinator uses reverse imports to send RPCs to agents
 * 6) userspace process copies data between Lustre and HSM backend
 * 7) will use special fid directory for file access (.lustre/fid/XXXX)
 * 8) interfaces with hardware-specific copytool to access HSM files
 * 9) kernel process passes requests to userspace process via socket

Copytool
The copytool copies data between Lustre and the HSM backend, and deletes the HSM object when necessary.
 * 1) userspace; runs on a Lustre client with HSM i/o access
 * 2) opens objects by fid
 * 3) may manipulate HSM mode flags in an EA.
 * 4) uses ioctl calls on the (opened-by-fid) file to report progress to MDT.  Note MDT must pass some messages on to Coordinator.
 * 5) updates progress regularly while waiting for HSM (e.g. every X seconds)
 * 6) reports error conditions
 * 7) reports current extent
 * 8) copytool is HSM-specific, since they must move data to the HSM archive
 * 9) version 1 will include tools for HPSS and SAM-QFS
 * 10) other, vendor-proprietary (binary) tools may be wrapped in order to include Lustre ioctl progress calls.

Policy Engine

 * 1) makes policy decisions for archive, release (which files and when)
 * 2) policy engine will provide the functionality of the Space_Manager and any other archive/release policies
 * 3) may be based on space available per filesystem, OST, or pool
 * 4) may be based on any filesystem or per-file attributes (last access time, file size, file type, etc)
 * 5) policy engine will therefore require access to various user-available info: changelogs, getstripe, lfs df, stat, lctl get_param, etc.
 * 6) normally uses changelogs and 'df' for input; rarely is allowed to scan filesystem
 * 7) changelogs are available to superuser on Lustre clients
 * 8) filesystem scans are expensive; allowed only at initial HSM setup time or other rate events
 * 9) the policy engine runs as a userspace process; requests archive and release via file ioctl to coordinator (through MDT).
 * 10) policy engine may be packaged separately from Lustre
 * 11) the policy engine may use HSM-backend specific features (e.g. HPSS storage class) for policy optimizations, but these will be kept modularized so they are easily removed for other systems.
 * 12) API can pass an opaque arbitrary chunk of data (char array, size) from policy engine ioctl call through coordinator and agent to copytool.

Configuration

 * 1) policy engine has it's own external configuration
 * 2) coordinator starts as part of MDT; tracks agents registrations as clients connect
 * 3) connect flag to indicate agent should run on this MDC
 * 4) mdt_set_info RPC for setting agent status using 'remount'

Version 1 ("simple"): "Migration on open" policy
Clients block at open for read and write. OSTs are not involved.
 * 1) Client layout-intent enqueues layout read lock on the MDT.
 * 2) MDT checks hsm_released bit; if released, the MDT takes PW lock on the layout
 * 3) MDT creates a new layout with a similar stripe pattern as the original, increasing the layout version, and allocating new objects on new OSTs with the new version.
 * (We should try to respect specific layout settings (pool, stripecount, stripesize), but be flexible if e.g. pool doesn't exist anymore.
 * Maybe we want to ignore stripe offset and/or specific OST allocations in order to rebalance.)
 * 1) MDT enqueues group write lock on extents 0-EOF
 * Extents lock enqueue timeout must be very long while group lock is held (need proc tunable here)
 * 1) MDT releases PW layout lock
 * Client open succeeds at this point, but r/w is blocked on extent locks
 * 1) MDT sends request to coordinator requesting restore of the file to .lustre/fid/XXXX with group lock id and extents 0-EOF. (Extents may be used in the future to (a) copy in part of a file, in low-disk-space situations; (b) copy in individual stripes simultaneously on multiple OSTs.)
 * 2) Coordinator distributes that request to an appropriate agent.
 * 3) Agent starts copytool
 * 4) Copytool opens .lustre/fid/
 * 5) Copytool takes group extents lock
 * 6) Copytool copies data from HSM, reporting progress via ioctl
 * 7) When finished, copytool reports progress of 0-EOF and closes the file, releasing group extents lock.
 * 8) MDT clears hsm_released bit
 * 9) MDT releases group extents lock
 * This sends a completion AST to the original client, who now receives his extents lock.
 * 1) MDT adds FID HSM_copyin_complete record to changelog (flags: failed)



Version 2 ("complex"): "Migration on first I/O" policy
Clients are able to read/write the file data as soon as possible and the OSTs need to prevent access to the parts of the file which have not yet been restored.
 * 1) getattr: attributes can be returned from MDT with no HSM involvement
 * 2) MDS holds file size[*]
 * 3) client may get MDS attribute read locks, but not layout lock


 * 1) Client open intent enqueues layout read lock.
 * 2) MDT checks "purged" bit
 * 3) MDT creates a new layout with a similar stripe pattern as the original, allocating new objects on new OSTs with per-object "purged" bits set.
 * 4) MDT grants layout lock to client and open completes
 * 5) ?Should we pre-stage:  MDT sends request to coordinator requesting copyin of the file to .lustre/fid/XXXX with extents 0-EOF.
 * 6) client enqueues extent lock on OST. Must wait forever.
 * 7) check OST object is marked fully/partly invalid
 * 8) object may have persistent invalid map of extent(s) that indicate which parts of object require copy-in
 * 9) access to invalid parts of object trigger copy-in upcall to coordinator for those extents
 * 10) coordinator consolidates repeat requests for the same range (e.g. if entire file has already been queued for copyin, ignore specific range requests??)
 * 11) ? group locks on invalid part of file block writes to missing data
 * 12) clients block waiting on extent locks for invalid parts of objects
 * 13) OST crash at this time will restart enqueue process during replay
 * 14) coordinator contacts agent(s) to retrieve FID N extents X-Y from HSM
 * 15) copytool writes to actual object to be restored with "clear invalid" flag (special write)
 * 16) writes by agent shrink invalid extent, periodically update on-disk invalid extent and release locks on that part of file (on commit?)
 * 17) note changing lock extents (lock conversion) is not currently possible but is a long-term Lustre performance improvement goal.
 * 18) client is granted extent lock when that part of file is copied in

copyout

 * 1) Policy engine (or administrator) decides to copy a file to HSM, executes HSMCopyOut ioctl on file
 * 2) ioctl caught by MDT, which passes request to Coordinator
 * 3) coordinator dispatches request to mover.  Request includes file extents (for future purposes)
 * 4) normal extents read lock is taken by mover running on client
 * 5) mover sends "copyout begin" message to coordinator via ioctl on the file
 * 6) coordinator/MDT sets "hsm_exists" bit and clears "hsm_dirty" bit.
 * "hsm_exists" bit is never cleared, and indicates a copy (maybe partial/out of date) exists in the HSM
 * 1) any writes to the file cause the MDT to set the "hsm_dirty" bit (may be lazy/delayed with mtime or filesize change updates on MDT for V1).
 * 2) file writes need not cancel copyout (settable via policy?  Implementation in V2.)
 * 3) mover sends status update to coordinator via periodic ioctl calls on the file (e.g % complete)
 * 4) mover sends "copyout done" message to coordinator via ioctl
 * 5) coordinator/MDT checks hsm_dirty bit.
 * 6) If not dirty, MDT sets "copyout_complete" bit.
 * 7) If dirty, coordinator dispatches another copyout request; goto step 3
 * 8) MDT adds FID X HSM_copyout_complete record to changelog
 * 9) Policy engine notes HSM_copyout_complete record from changelog (flags: failed, dirty)

(Note: files modifications after copyout is complete will have both copyout_complete and hsm_dirty bits set.)



V1: full file purge

 * 1) Policy engine (or administrator) decides to purge a file, executes HSMPurge ioctl on file
 * 2) ioctl handled by MDT
 * 3) MDT takes a write lock on the file layout lock
 * 4) MDT enques write locks on all extents of the file.  After these are granted, then no client has any dirty cache and no child can take new extent locks until layout lock is released.  MDT drops all extent locks.
 * 5) MDT verifies that hsm_dirty bit is clear and copyout_complete bit is set
 * 6) if not, the file cannot be purged, return EPERM
 * 7) MDT marks the LOV EA as "purged"
 * 8) MDT sends destroys to the OST objects, using destroy llog entries to guard against object leakage during OST failover
 * 9) the OSTs should eventually purge the objects during orphan recovery
 * 10) MDT drops layout lock.

V2: partial purge
Partially purged files hopefully allows graphical file browsers to retrieve file header info or icons stored at the beginning or end of a file. Note: determine exactly which parts of a file that Windows Explorer reads to generate it's icons
 * 1) MDT sends purge range to first and last objects, and destroys to all intermediate objects, using llog entries for recovery.
 * 2) First and last OSTs record purge range
 * 3) When requesting copyin of the entire file (first access to the middle of a partially purged file), MDT destroys old partial objects before allocating new layout. (Or: we keep old first and last objects, just allocate new "middle object" striping - yuck.)

unlink

 * 1) A client issues an unlink for a file to the MDT.
 * 2) The MDT includes the "hsm_exists" bit in the changelog unlink entry
 * 3) The policy engine determines if the file should be removed from HSM
 * 4) Policy engine sends HSMunlink FID to coordinator via MDT ioctl
 * 5) ioctl will be on the directory .lustre/fid
 * or perhaps on a new .lustre/dev/XXX where any lustre device may be listed, and act as stub files for handling ioctls.
 * 1) The coordinator sends a request to one of its agent for the corresponding removal.
 * 2) The agent spawns the HSM tool to do this removal.
 * 3) HSM tool reports completion via another MDT ioctl
 * 4) Coordinator cancels unlink request record
 * 5) In case of agent crash, unlink request will remain uncancelled and coordinator will eventually requeue
 * 6) In case of coordinator crash, agent ioctl will proceed after recovery
 * 7) Policy engine notes HSM_unlink_complete record from changelog (flags: failed)

abort

 * 1) abort dead agent
 * the coordinator must send an abort signal to an agent to abort a copyout/copyin if it determines the migration is stuck/crashed. The coordinator can then re-queue the migration request elsewhere.
 * 1) dirty-while-copyout
 * If a file is written to while it is being copied out, the copyout will have an incoherent copy in some cases.
 * 1) We could send abort signal, but:
 * 2) If a filesystem has a single massive file that is used all the time, it will never get backed up if we abort.
 * 3) Not a problem if just appending to a file
 * 4) Most backup systems work this way with relatively little harm.
 * 5) V1: don't abort this case
 * 6) V2: abort in this case is a settable policy

MDT crash

 * 1) MDT crashes and is restarted.
 * 2) The coordinator recreates its migration list, reading the its llog.
 * 3) The client, when doing its recovery with the MDT, reconnects to the coordinator.
 * 4) Copytool eventually sends its periodic status update for migrating files (asynchronously from reconnect).
 * 5) As far as the copytools/agent is concerned, the MDT restart is invisible.

Note: The migration list is simply the list of unfinished migrations which may be read from the llog at any time (no need to keep it in memory all the time, if there are many open migration requests).

Logs should contain:
 * 1) fid, request type, agent_id (for aborts)
 * 2) if the list is not kept in memory: last_status_update_time, last_status.

Client crash
overwrite newly-modified data (data modified by regular clients after HSM/Lustre think copyin is complete.)
 * 1) Client stops communicating with MDT
 * 2) MDT evicts client
 * 3) Eviction triggers coordinator to re-dispatch immediately all of the migrations from that agent
 * 4) For copyin, it is desireable that any existing agent I/O is stopped
 * 5) Ghost client and copytool may still be alive and communicating with OSTs, but not MDT.  Can't send abort.
 * 6) Taking file extent locks will only temporarily stop ghost.
 * 7) It's not so bad if new agent and ghost are racing trying to copyin the file at the same time.
 * 8) Regular extent locks prevent file corruption
 * 9) The file data being copied in is the same
 * 10) Ghost copyin may still be ongoing after new copyin has finished, in which case ghost may

Copytool crash
Copytool crash is different from a client crash, since the client will not get evicted
 * 1) Copytool crashes
 * 2) Coordinator notices no status updates
 * 3) Coordinator sends abort signal to old agent
 * 4) Coordinator re-dispatches migration

Implementation constraints

 * 1) all single-file coherency issues are in kernel space (file locking, recovery)
 * 2) all policy decisions are in user space (using changelogs, df, etc)
 * 3) coordinator/mover communication will use LNET
 * 4) Version 1 HSM is a simplified implementation:
 * 5) integration with HPSS only
 * 6) depends on changelog for policy decisions
 * 7) restore on file open, not data read/write
 * 8) HSM tracks entire files, not stripe objects
 * 9) HSM namespace is flat, all files are addressed by FID only
 * 10) Coordinator and movers can be reused by (non-HSM) replication

HSM Migration components & interactions
Note: for V1, copyin initiators are on MDT only (file open).

For further review/detail

 * 1) "complex" HSM roadmap
 * 2) partial access to files during restore
 * 3) partial purging for file type identification, image thumbnails, ??
 * 4) integration with other HSM backends (ADM, ??)
 * 5) How can layout locks be held in liblustre

= References =

HSM implementation 15599 changelogs 15699