WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

RAID5 Patches

From Obsolete Lustre Wiki
Revision as of 23:11, 29 April 2007 by Yep (talk | contribs) (→‎Structures)
Jump to navigationJump to search

Notes about RAID5 internals

Structures

In Linux RAID5 handles all incoming requests by small units called stripes. Stripe is a set of blocks taken from all disks at the same position. Block is defined as unit of PAGE_SIZE bytes.

For example, you have 3 disks and specified 8K chunksize. Then RAID5 will be looking internally as the following:

S0 S8 S32 S40
Disk1 #0 #8 #32 #40
Disk2 #16 #24 #48 #56
Disk3 P0 P8 P32 P40

where:

 * Sn -- number of internal stripe
 * #n -- is an offset in sectors (512bytes)
 * Pn -- parity for other blocks in the stripe (actually it floats among disks)

As you can see, 8K chunksize means 2 contig. blocks.

Logic

make_request() goes through incoming request, breaking it into blocks (PAGE_SIZE) and handling them separately. given bio with bi_sector = 0 bi_size = 24K and array described above, make_request() would handle #0,

  1. 8 and #16.

For every block, add_stripe_bio() and handle_stripe() are called.

add_stripe_bio() intention is to add bio to given stripe, later in handle_stripe() we'll be able to use bio and its data for serving requests.

handle_stripe() is a core of raid5, we'll discsuss it in the next part.

handle_stripe()

the routine works with a stripe. it checks what should be done, learns current state of a stripe in the internal cache, makes decision what I/O is needed to satisfy users requests and does recovery.

say, user wants to write block #0 (8 sectors starting from sector 0). raid5's responsibility is to store new data and update parity P0. there are few possibilities here:

1. delay serving till data for block #16 is ready -- probably user will want to write #16 very soon?
2. read #16, make a new parity P0; write #0 and P0
3. read P0, rollback old #0 from P0 (so, it will look like we did parity with #0) and re-compute parity with new #0

1st way looks the better because it doesn't require very expensive read, but the problem is that user may need to write only #0 and not #16 in near future. also, the queue can get unplugged meaning that user wants all requests to complete (unfortunately, in current block layer there is no way to specify which exact request user is interested in, so any completion interest means immediate serving of the whole queue).

Problems

Short list of the problem in raid5 we met in Thumper project:

* order of handling isn't good for large requests
  As handle_stripe() goes in logical block order, it
  handles S0, then S8, then again S0 and S8. After the first touch
  S0 is left with block #0 uptodate, while #16 and P0 are not. Thus
  if the stripe is forced for completion, we'd need to read block
  #16 or P0 to get full uptodate stripe. Such reads hurt throughput
  almost to death. If just a single process writes, then things are
  OK, because nobody unplugs the queue and there is no requests to
  force completion of pending request. But the more writers, the
  often queue unplug happens and the often pending requests are forced
  for completion. Take into account that in reallity we use large
  chuck size (128K, 256K and even larger), hence tons of non-uptodate
  stripes in the cache and tons of reads in the end.
* memcpy() is top consumer
  all requests go via internal cache. on dual-core 2way opteron
  it takes up to 30-33% of CPU doing 1GB/s write
* small requests
  to fill I/O pipes and reach good throughput we need quite large
  I/O requests. Lustre does this using bio subsystem on 2.6. but
  as it was mentioned, raid5 handles all blocks separately and
  issues for every block separate I/O (bio). this is solved partial
  by I/O scheduler that merges small requests into bigger ones, but
  due to nature of block subsystem, any process that wants I/O to
  get completed, unplug queue and we can get many small requests
  in the pipe.

We developed patches that address described problems. You can find them in ftp://ftp.clusterfs.com/pub/people/alex/raid5