Smokin' Mirrors – To Talk of Many Things

Smokin' Mirrors.

Resilvering — also known as resyncing, rebuilding, or reconstructing —
is the process of repairing a damaged device using the contents of healthy devices.
This is what every volume manager or RAID array must do when one of its
disks dies, gets replaced, or suffers a transient outage.

For a mirror, resilvering can be as simple as a whole-disk copy.
For RAID-5 it's only slightly more complicated: instead of copying one
disk to another, all of the other disks in the RAID-5 stripe must be
XORed together. But the basic idea is the same.

In a traditional storage system, resilvering happens either in the
volume manager or in RAID hardware. Either way, it happens well below
the filesystem.

But this is ZFS, so of course we just had to be different.

In a previous post I mentioned that
RAID-Z
resilvering requires a different approach, because it needs the
filesystem metadata to determine the RAID-Z geometry. In effect, ZFS
does a 'cp -r' of the storage pool's block tree from one disk to another.
It sounds less efficient than a straight whole-disk copy, and traversing
a live pool safely is definitely tricky (more on that in a future post).
But it turns out that there are so many advantages to metadata-driven
resilvering that we've chosen to use it even for simple mirrors.

The most compelling reason is data integrity. With a simple disk copy,
there's no way to know whether the source disk is returning good data.

End-to-end data integrity
requires that each data block be verified
against an independent checksum — it's not enough know that each
block is merely consistent with itself, because that doesn't catch common
hardware and firmware bugs like misdirected reads and phantom writes.

By traversing the metadata, ZFS can use its end-to-end checksums to detect
and correct silent data corruption, just like it does during normal reads.
If a disk returns bad data transiently, ZFS will detect it and retry the read.
If it's a 3-way mirror and one of the two presumed-good disks is damaged,
ZFS will use the checksum to determine which one is correct, copy the data
to the new disk, and repair the damaged disk.

A simple whole-disk copy would bypass all of this data protection.
For this reason alone, metadata-driven resilvering would be desirable
even it it came at a significant cost in performance.

Fortunately, in most cases, it doesn't. In fact, there are several
advantages to metadata-driven resilvering:

Live blocks only.
ZFS doesn't waste time and I/O bandwidth copying free disk blocks
because they're not part of the storage pool's block tree.
If your pool is only 10-20% full, that's a big win.

Transactional pruning. If a disk suffers a transient outage,
it's not necessary to resilver the entire disk — only the parts that
have changed. I'll describe this in more detail in a future post,
but in short: ZFS uses the birth time of each block to determine
whether there's anything lower in the tree that needs resilvering.
This allows it to skip over huge branches of the tree and quickly
discover the data that has actually changed since the outage began.

What this means in practice is that if a disk has a five-second
outage, it will only take about five seconds to resilver it.
And you don't pay extra for it — in either dollars or performance —
like you do with Veritas change objects. Transactional pruning
is an intrinsic architectural capability of ZFS.

Breadth-first resilvering. A storage pool is a tree of blocks.
The higher up the tree you go, the more disastrous it is to lose a
block there, because you lose access to everything beneath it.

Going through the metadata allows ZFS to do breadth-first resilvering.
That is, the very first thing ZFS resilvers is the uberblock and the
disk labels. Then it resilvers the pool-wide metadata; then each
filesystem's metadata; and so on down the tree. Throughout the process
ZFS obeys this rule: no block is resilvered until all of its ancestors
have been resilvered.

It's hard to overstate how important this is. With a whole-disk copy,
even when it's 99% done there's a good chance that one the top 100
blocks in the tree hasn't been copied yet. This means that from an
MTTR perspective, you haven't actually made any progress: a second
disk failure at this point would still be catastrophic.

With breadth-first resilvering, every single block copied increases
the amount of discoverable data. If you had a second disk failure,
everything that had been resilvered up to that point would be available.

Priority-based resilvering. ZFS doesn't do this one yet, but
it's in the pipeline. ZFS resilvering follows the logical structure
of the data, so it would be pretty easy to tag individual filesystems
or files with a specific resilver priority. For example, on a file
server you might want to resilver calendars first (they're important
yet very small), then /var/mail, then home directories, and so on.

What I hope to convey with each of these posts is not just the mechanics
of how a particular feature is implemented, but to illustrate how all the
parts of ZFS form an integrated whole. It's not immediately obvious,
for example, that transactional semantics would have anything to do with
resilvering — yet transactional pruning makes recovery from transient
outages literally orders of magnitude faster. More on how that
works in the next post. [Jeff Bonwick's Weblog]

Leave a comment