Architecture - Backup
Note: The content on this page reflects the state of design of a Lustre feature at a particular point in time and may contain outdated information.
The architecture described here enables:
- full and incremental backups
- reasonable scalability - file systems with 100M files with total data of 100TB and seeing 10% of changes daily should be covered.
- parallel data movement to/from the file system
- if possible, rename awareness
Decomposition of the design
The architecture decomposes into several pieces:
- a file system scanner which can find changed inodes and their parent directories
- a database which is used in conjunction with the scanner to construct a changelog from the results of the scan and from the previous snapshot of the file system
- a synchronization program which processes the records of the changelog to perform the backup
Use cases to cover
|deployment||scanning is done on the MDS|
|performance||scanning will be very significantly faster than running a "find" process, even on a local file system, and use a significant % of disk bandwidth.|
|constraint||the database recording results of the scan and snapshots of the namespace shall be an SQL database|
|performance||deletions and renames shall be detected by combining the results from the scan with the database content|
|changelog||the changelog produced for backup synchronization shall be in a specified format, usable for many synchronization purposes|
|performance||the changelog will allow parallel processing of backup of file data by distributing such work over multiple systems|
|constraint||the solution can integrate with the Lustre version of GNU tar (gtar) to restore Lustre striping patterns correctly.|
|feature||the database can contain file sizes|
State and State Changes
- State - at time T_N the system will consist of
- A database DB_N with directory entries (ino, gen, pino, pgen?, name) representing approximately the namespace of the file system FS_N to be backed up. The word approximately is used to draw attention to the fact that the file system is changing during the backup operation.
- A backup BU_N of the file tree
- A timestamp T_N - all changes made to the namespace prior to T_N are represented in the database
- A collection of changes between FS_(N-1) and FS_N consisting of:
- deleted files
- modified files
- created files
- renamed files
- State changing algorithm - move state from DB_N to DB_(N+1) and BU_N to BU_(N+1)
- Scanning Phase - find changed and deleted inodes
- Update the database - find re-used inodes (deletion followed by a creation) and renamed inodes. Use directory information in FS_(N+1) and DB_N to create DB_(N+1)
- Produce the change lists
- Synchronization algorithm - using the lists of changes, change the BU_N to BU_(N+1).
Suggestions for algorithms
Initial scanning should read inodes in chunks of several MB at a time. From I/O studies we know that this can be done at raw disk speeds, leading to processing 200K inodes/sec for each 100MB/sec of disk bandwidth where the inode size is 512B. This process can write out a table of modified and deleted inodes. If the number of changed inodes is relatively small (approximately 10% of all inodes in use) it should be possible to create a table on a similar storage device without falling behind the reading.
The directory data reads in FS_N are a concern for poor I/O performance.
The synchronization algorithm must operate on BU, which does not have the file identifiers (ino, gen) that are found in FS. As a result the operations must be specified using pathnames. A suggestion has been made to sort the operations by sorting the pathnames lexicographically. Renames would be inserted in the sequence after creations, updates and deletions have been inserted for shorter pathnames considering both the source and destination pathnames of the rename operation. It needs to be proven that this algorithm is correct.
Synchronization should store the striping attributes of the source file in an EA associated with the destination file.
File data synchronization must be distributed over multiple client nodes for sufficient bandwidth. While the file data is being synchronized, the updated file sizes can be stored in the database.