WARNING: This is the _old_ Lustre wiki, and it is in the process of being retired. The information found here is all likely to be out of date. Please search the new wiki for more up to date information.

RAID5 Patches: Difference between revisions

From Obsolete Lustre Wiki
Jump to navigationJump to search
Line 5: Line 5:
In Linux, RAID5 handles all incoming requests by small units called '''stripes'''.
In Linux, RAID5 handles all incoming requests by small units called '''stripes'''.
A stripe is a set of '''blocks''' taken from all disks at the same position.
A stripe is a set of '''blocks''' taken from all disks at the same position.
A block is defined as unit of PAGE_SIZE bytes.  
A block is defined as a unit of PAGE_SIZE bytes.  


For example, suppose you have 3 disks and have specified 8K chunksize. Then, internally, RAID5 will look as follows:
For example, suppose you have 3 disks and have specified 8K chunksize. Internally, RAID5 will look like this:
{|border=1 cellspacing=0
{|border=1 cellspacing=0
|-
|-
Line 25: Line 25:
* Pn -- Parity for other blocks in the stripe (actually, it floats among disks)
* Pn -- Parity for other blocks in the stripe (actually, it floats among disks)


As you can see, 8K chunksize means 2 contiguous blocks.
As you can see, an 8K chunksize means 2 contiguous blocks.


== Logic ==
== Logic ==


''make_request()'' goes through an incoming request, breaking it into '''blocks''' (PAGE_SIZE) and handling them separately. given bio with bi_sector = 0 bi_size = 24K and array described above, ''make_request()'' would handle #0,#8 and #16.
''make_request()'' goes through an incoming request, breaking it into '''blocks''' (PAGE_SIZE) and handling them separately. Given bio with bi_sector = 0 bi_size = 24K and the array described above, ''make_request()'' would handle #0,#8 and #16.


For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.
For every block, ''add_stripe_bio()'' and ''handle_stripe()'' are called.
Line 35: Line 35:
''add_stripe_bio()'' the intention is to add bio to a given stripe. Later, in ''handle_stripe()'', we will be able to use bio and its data for serving requests.
''add_stripe_bio()'' the intention is to add bio to a given stripe. Later, in ''handle_stripe()'', we will be able to use bio and its data for serving requests.


''handle_stripe()'' is a core of raid5 (discussed in the next section).
''handle_stripe()'' is a core of RAID5 (as discussed in the next section).


== handle_stripe() ==
== handle_stripe() ==


The routine works with a stripe. It checks what should be done, learns the current state of a stripe in the internal cache, makes decision what I/O is needed to satisfy user requests, and does recovery.
The routine works with a stripe. It checks what should be done, learns the current state of a stripe in the internal cache, decides what I/O is needed to satisfy user requests, and does recovery.


For example, if a user wants to write block #0 (8 sectors starting from sector 0). RAID5's responsibility is to store new data and update parity P0. There are a few possibilities here:
For example, if a user wants to write block #0 (8 sectors starting from sector 0), RAID5's responsibility is to store new data and update parity P0. There are a few possibilities here:


# delay serving until the data for block #16 is ready -- probably the user will want to write #16 very soon?
# delay serving until the data for block #16 is ready -- probably the user will want to write #16 very soon?
# read #16, make a new parity P0; write #0 and P0
# read #16, make a new parity P0; write #0 and P0.
# read P0, roll back the old #0 from P0 (so it looks like we did parity with #0) and re-compute parity with the new #0
# read P0, roll back the old #0 from P0 (so it looks like we did parity with #0) and re-compute parity with the new #0.


The first way looks like the better option because it does not require a very expensive read, but the problem is that user may need to write only #0 and not #16 in near future.
The first possibility looks like the best option because it does not require a very expensive read, but the problem is that user may need to write only #0 and not #16 in near future.


Also, the queue can get unplugged meaning that user wants all requests to
Also, the queue can get unplugged, meaning that the user wants all requests to
complete (unfortunately, in current block layer there is no way to specify
complete. (Unfortunately, in the current block layer, there is no way to specify the exact request that the user is interested in, so any completion interest means immediate serving of the entire queue).
which exact request user is interested in, so any completion interest means
immediate serving of the whole queue).


== Problems ==
== Problems ==


Short list of the problem in raid5 we met in Thumper project:
This is a short list of RAID5 problems that we encountered in the Thumper project:


; * order of handling isn't good for large requests
; * Order of handling is not good for large requests
   As ''handle_stripe()'' goes in logical block order, it
   As ''handle_stripe()'' goes in logical block order, it
   handles S0, then S8, then again S0 and S8. After the first touch
   handles S0, then S8, and then again S0 and S8. After the first touch,
   S0 is left with block #0 uptodate, while #16 and P0 are not. Thus
   S0 is left with block #0 up-to-date, while #16 and P0 are not. Thus,
   if the stripe is forced for completion, we'd need to read block
   if the stripe is forced for completion, we would need to read block
   #16 or P0 to get full uptodate stripe. Such reads hurt throughput
   #16 or P0 to get a fully up-to-date stripe. Such reads hurt throughput
   almost to death. If just a single process writes, then things are
   almost to death. If just a single process writes, then things are
   OK, because nobody unplugs the queue and there is no requests to
   OK, because nobody unplugs the queue, and there are no requests to
   force completion of pending request. But the more writers, the
   force completion of a pending request. But, if there are more writers, then
   often queue unplug happens and the often pending requests are forced
   a queue unplug often occurs, and pending requests are often forced
   for completion. Take into account that in reallity we use large
   for completion. Take into account, that in reality, we use a large
   chuck size (128K, 256K and even larger), hence tons of non-uptodate
   chuck size (128K, 256K and even larger). Hence, in the end, there
  stripes in the cache and tons of reads in the end.
  are many out-of-date stripes in the cache and many reads.


; * memcpy() is top consumer
; * memcpy() is a top consumer
   all requests go via internal cache. on dual-core 2way opteron
   All requests go via internal cache, on dual-core, two-way Opteron.
   it takes up to 30-33% of CPU doing 1GB/s write
   It takes up to 30-33% of CPU doing 1 GB/s writes.


; * small requests
; * Small requests
   to fill I/O pipes and reach good throughput we need quite large
   To fill I/O pipes and reach good throughput, we need very large
   I/O requests. Lustre does this using bio subsystem on 2.6. but
   I/O requests. Lustre does this by using a bio subsystem on 2.6. But, as
   as it was mentioned, raid5 handles all blocks separately and
   described above, RAID5 handles all blocks separately and
   issues for every block separate I/O (bio). this is solved partial
   issues separate I/O (bio) for every block. This is partially solved  
   by I/O scheduler that merges small requests into bigger ones, but
   by an I/O scheduler that merges small requests into bigger ones. But,  
   due to nature of block subsystem, any process that wants I/O to
   due to the nature of the block subsystem, any process that wants I/O to
   get completed, ''unplug'' queue and we can get many small requests
   get completed ''unplugs'' the queue, and we can get many small requests
   in the pipe.
   in the pipe.


We developed patches that address described problems. You can find
We have developed patches that address the described problems. You can find
them in ftp://ftp.clusterfs.com/pub/people/alex/raid5
them at ftp://ftp.clusterfs.com/pub/people/alex/raid5

Revision as of 16:59, 2 May 2008

Notes about RAID5 Internals

Structures

In Linux, RAID5 handles all incoming requests by small units called stripes. A stripe is a set of blocks taken from all disks at the same position. A block is defined as a unit of PAGE_SIZE bytes.

For example, suppose you have 3 disks and have specified 8K chunksize. Internally, RAID5 will look like this:

S0 S8 S32 S40
Disk1 #0 #8 #32 #40
Disk2 #16 #24 #48 #56
Disk3 P0 P8 P32 P40

where:

  • Sn -- Number of internal stripe
  • #n -- An offset in sectors (512bytes)
  • Pn -- Parity for other blocks in the stripe (actually, it floats among disks)

As you can see, an 8K chunksize means 2 contiguous blocks.

Logic

make_request() goes through an incoming request, breaking it into blocks (PAGE_SIZE) and handling them separately. Given bio with bi_sector = 0 bi_size = 24K and the array described above, make_request() would handle #0,#8 and #16.

For every block, add_stripe_bio() and handle_stripe() are called.

add_stripe_bio() the intention is to add bio to a given stripe. Later, in handle_stripe(), we will be able to use bio and its data for serving requests.

handle_stripe() is a core of RAID5 (as discussed in the next section).

handle_stripe()

The routine works with a stripe. It checks what should be done, learns the current state of a stripe in the internal cache, decides what I/O is needed to satisfy user requests, and does recovery.

For example, if a user wants to write block #0 (8 sectors starting from sector 0), RAID5's responsibility is to store new data and update parity P0. There are a few possibilities here:

  1. delay serving until the data for block #16 is ready -- probably the user will want to write #16 very soon?
  2. read #16, make a new parity P0; write #0 and P0.
  3. read P0, roll back the old #0 from P0 (so it looks like we did parity with #0) and re-compute parity with the new #0.

The first possibility looks like the best option because it does not require a very expensive read, but the problem is that user may need to write only #0 and not #16 in near future.

Also, the queue can get unplugged, meaning that the user wants all requests to complete. (Unfortunately, in the current block layer, there is no way to specify the exact request that the user is interested in, so any completion interest means immediate serving of the entire queue).

Problems

This is a short list of RAID5 problems that we encountered in the Thumper project:

* Order of handling is not good for large requests
  As handle_stripe() goes in logical block order, it
  handles S0, then S8, and then again S0 and S8. After the first touch, 
  S0 is left with block #0 up-to-date, while #16 and P0 are not. Thus, 
  if the stripe is forced for completion, we would need to read block
  #16 or P0 to get a fully up-to-date stripe. Such reads hurt throughput
  almost to death. If just a single process writes, then things are
  OK, because nobody unplugs the queue, and there are no requests to
  force completion of a pending request. But, if there are more writers, then
  a queue unplug often occurs, and pending requests are often forced
  for completion. Take into account, that in reality, we use a large
  chuck size (128K, 256K and even larger). Hence, in the end, there 
  are many out-of-date stripes in the cache and many reads.
* memcpy() is a top consumer
  All requests go via internal cache, on dual-core, two-way Opteron. 
  It takes up to 30-33% of CPU doing 1 GB/s writes.
* Small requests
  To fill I/O pipes and reach good throughput, we need very large
  I/O requests. Lustre does this by using a bio subsystem on 2.6. But, as 
  described above, RAID5 handles all blocks separately and
  issues separate I/O (bio) for every block. This is partially solved 
  by an I/O scheduler that merges small requests into bigger ones. But, 
  due to the nature of the block subsystem, any process that wants I/O to
  get completed unplugs the queue, and we can get many small requests
  in the pipe.

We have developed patches that address the described problems. You can find them at ftp://ftp.clusterfs.com/pub/people/alex/raid5