RAID-Z

RAID-Z.

The original promise of RAID (Redundant Arrays of Inexpensive Disks)
was that it would provide fast, reliable storage using cheap disks.
The key point was cheap; yet somehow we ended up
here. Why?

RAID-5 (and other data/parity schemes such as RAID-4, RAID-6, even-odd,
and Row Diagonal Parity) never quite delivered on the RAID promise — and can't —
due to a fatal flaw known as the RAID-5 write hole. Whenever you update the
data in a RAID stripe you must also update the parity, so that all disks XOR
to zero — it's that equation that allows you to reconstruct data when a
disk fails. The problem is that there's no way to update two or more disks
atomically, so RAID stripes can become damaged during a crash or power outage. . . .

Enter RAID-Z.

RAID-Z is a data/parity scheme like RAID-5, but it uses dynamic stripe width.
Every block is its own RAID-Z stripe, regardless of blocksize. This means
that every RAID-Z write is a full-stripe write. This, when combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the
RAID write hole.
RAID-Z is also faster than traditional RAID because it never has to do
read-modify-write.

Whoa, whoa, whoa — that's it? Variable stripe width? Geez, that seems
pretty obvious. If it's such a good idea, why doesn't everybody do it?

Well, the tricky bit here is RAID-Z reconstruction. Because the stripes
are all different sizes, there's no simple formula like “all the disks
XOR to zero.” You have to traverse the filesystem metadata to determine
the RAID-Z geometry. Note that this would be impossible if the filesystem
and the RAID array were separate products, which is why there's nothing
like RAID-Z in the storage market today. You really need an integrated
view of the logical and physical structure of the data to pull it off.

But wait, you say: isn't that slow? Isn't it expensive to traverse
all the metadata? Actually, it's a trade-off. If your storage pool
is very close to full, then yes, it's slower. But if it's not too
close to full, then metadata-driven reconstruction is actually faster
because it only copies live data; it doesn't waste time copying
unallocated disk space.

But far more important, going through the metadata means that ZFS
can validate every block against its 256-bit checksum as it goes.
Traditional RAID products can't do this; they simply XOR the data
together blindly.

Which brings us to the coolest thing about RAID-Z: self-healing data.
In addition to handling whole-disk failure, RAID-Z can also detect
and correct silent data corruption. Whenever you read a RAID-Z block,
ZFS compares it against its checksum. If the data disks didn't return
the right answer, ZFS reads the parity and then does combinatorial
reconstruction to figure out which disk returned bad data. It then
repairs the damaged disk and returns good data to the application.
ZFS also reports the incident through Solaris FMA so that the system
administrator knows that one of the disks is silently failing.

Finally, note that RAID-Z doesn't require any special hardware.
It doesn't need NVRAM for correctness, and it doesn't need write buffering
for good performance. With RAID-Z, ZFS makes good on the original RAID
promise: it provides fast, reliable storage using cheap, commodity disks.

The current RAID-Z algorithm is single-parity, but the RAID-Z concept
works for any RAID flavor. A double-parity version is in the works.

One last thing that fellow programmers will appreciate: the entire
RAID-Z implementation
is just 599 lines.  [Jeff Bonwick's Weblog]

Leave a comment