RAID5 Patches

= Notes about RAID5 Internals =

Structures
In Linux, RAID5 handles all incoming requests by small units called stripes. A stripe is a set of blocks taken from all disks at the same position. A block is defined as a unit of PAGE_SIZE bytes.

For example, suppose you have 3 disks and have specified 8K chunksize. Internally, RAID5 will look like this:

where:


 * Sn -- Number of internal stripe
 * #n -- An offset in sectors (512bytes)
 * Pn -- Parity for other blocks in the stripe (actually, it floats among disks)

As you can see, an 8K chunksize means 2 contiguous blocks.

Logic
make_request goes through an incoming request, breaking it into blocks (PAGE_SIZE) and handling them separately. Given bio with bi_sector = 0 bi_size = 24K and the array described above, make_request would handle #0,#8 and #16.

For every block, add_stripe_bio and handle_stripe are called.

add_stripe_bio the intention is to add bio to a given stripe. Later, in handle_stripe, we will be able to use bio and its data for serving requests.

handle_stripe is a core of RAID5 (as discussed in the next section).

handle_stripe
The routine works with a stripe. It checks what should be done, learns the current state of a stripe in the internal cache, decides what I/O is needed to satisfy user requests, and does recovery.

For example, if a user wants to write block #0 (8 sectors starting from sector 0), RAID5's responsibility is to store new data and update parity P0. There are a few possibilities here:


 * 1) delay serving until the data for block #16 is ready -- probably the user will want to write #16 very soon?
 * 2) read #16, make a new parity P0; write #0 and P0.
 * 3) read P0, roll back the old #0 from P0 (so it looks like we did parity with #0) and re-compute parity with the new #0.

The first possibility looks like the best option because it does not require a very expensive read, but the problem is that user may need to write only #0 and not #16 in near future.

Also, the queue can get unplugged, meaning that the user wants all requests to complete. (Unfortunately, in the current block layer, there is no way to specify the exact request that the user is interested in, so any completion interest means immediate serving of the entire queue).

Problems
This is a short list of RAID5 problems that we encountered in the Thumper project:

As handle_stripe goes in logical block order, it  handles S0, then S8, and then again S0 and S8. After the first touch, S0 is left with block #0 up-to-date, while #16 and P0 are not. Thus, if the stripe is forced for completion, we would need to read block #16 or P0 to get a fully up-to-date stripe. Such reads hurt throughput almost to death. If just a single process writes, then things are OK, because nobody unplugs the queue, and there are no requests to  force completion of a pending request. But, if there are more writers, then a queue unplug often occurs, and pending requests are often forced for completion. Take into account, that in reality, we use a large chuck size (128K, 256K and even larger). Hence, in the end, there are many out-of-date stripes in the cache and many reads.
 * * Order of handling is not good for large requests

All requests go via internal cache, on dual-core, two-way Opteron. It takes up to 30-33% of CPU doing 1 GB/s writes.
 * * memcpy is a top consumer

To fill I/O pipes and reach good throughput, we need very large I/O requests. Lustre does this by using a bio subsystem on 2.6. But, as   described above, RAID5 handles all blocks separately and issues separate I/O (bio) for every block. This is partially solved by an I/O scheduler that merges small requests into bigger ones. But, due to the nature of the block subsystem, any process that wants I/O to  get completed unplugs the queue, and we can get many small requests in the pipe.
 * * Small requests

We have developed patches that address the described problems. You can find them at ftp://ftp.clusterfs.com/pub/people/alex/raid5

Zero-copy Patch
In the current RAID5 implementation, there is a cache for each device involved. For each instance of I/O, the RAID5 driver updates the device cache first, and then submits the read/write request to the real block device from there. The operation to update cache consumes lots of CPU time copying data from/to the device cache. For Lustre's bulk write, it is meaningless to update this cache space at all. This is the basis of the zero-copy patch for RAID5.

To avoid the data copy, the pages from bio have to be used directly to calculate the parity page, and feed the I/O source to the disks. These pages (bio pages from the filesystem layer) should not be modified when calculating the parity page. Otherwise, the wrong parity will be written to the storage, which causes the garage data to be generated if the administrator tries to rebuild the array in the future.

Our solution is to lock the pages and then unmap the pages, if the pages are being mapped (to keep it from being modified). For this purpose, an additional page flag (PG_constant) has been introduced. The page is locked and then unmapped. Then, set this bit to let the RAID5 driver know that this page will not be modified during I/O time. After the I/O is finished against this page, the bit will be cleared so it can be written again.