Experiments with ZFS Failure Modes and Autogrowth

Before putting technology in production I like to have a very good understanding of how it handles failure. Just putting something cool into production because it feels right is not my style. I have to prove it in my lab first.

I started with a Dell Precision T3500 in my lab set up with SmartOS and one 1TB Western Digital Caviar Green SATA drive.  To get off the ground I configured a ZFS pool on that first drive during boot configuration in SmartOS.  I played around a bit with zones and VMs and then decided I would run ZFS through its paces to see how it copes with unplanned adds and removals of disks.

When ZFS resilvers a drive it is copying data from other drives in the pool to repopulate it with the correct replica data set depending on your ZFS configuration.  For anyone new to SmartOS zones is the name of the default ZFS pool or zpool.  It is not a ZFS-specific thing.  I added a another 1TB WD SATA drive and added it to SmartOS's default zones pool:

zpool add zones mirror c1t0d0 c1t1d0

UPDATE: when I applied this to my research servers (Dell PowerEdge R720xd's with PERC H710) I needed to run "zpool attach zones c0t0d0 c0t1d0" because simply adding a mirror did not work. YMMV.

Immediately ZFS began to resilver the new drive by copying data to it from the existing 1TB drives in the zones pool.  This process did not seem to take long, but I was not calculating at that point.

This evening I shut down the machine (regrettably I don't have a hotswap bay) and removed one of the 1TB drives and replaced it with a 2TB WD Red NAS Hard Drive.  I expected that ZFS would recognize the drive, format it and resilver it.  In retrospect I am very glad that is not the case since it would result in nuking a data-bearing drive.

What actually happened is told in detail by zpool status zones as you will see here:

  pool: zones
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 4.01G in 0h0m with 0 errors on Fri Nov 16 03:43:09 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            12819642171874297944  UNAVAIL      0     0     0  was /dev/dsk/c1t1d0s0

errors: No known data errors

An oddity here in my impression is that the device name becomes a number.  The really nice thing you get from Illumos here is http://illumos.org/msg/ZFS-8000-2Q which brought me directly to a step-by-step process for recovering from my "failed" drive.   I happen to know that my new 2TB drive is both good and has no data I need.  It actually just came out of it's sealed anti-static bag so I'm really sure.

Back to business.  I go back to the Illumos link and it effectively advises me to physically replace the disk (done) and then run zpool online zones c1d1t0.  Because I removed the drive without telling ZFS I get the following message:

warning: device 'c1t1d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

To recover from this corner I ran zpool offline zones c1d1t0 which yields no output (indicating that it succeeds) and zpool status zones to get the following:

  pool: zones
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 4.01G in 0h0m with 0 errors on Fri Nov 16 03:43:09 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            12819642171874297944  OFFLINE      0     0     0  was /dev/dsk/c1t1d0s0

errors: No known data errors

In order to tell ZFS that I have intentionally replaced the drive and this is not just a rogue drive I run zpool replace zones c1t1d0 c1t1d0. This seems redundant, but makes sense since I am indicating that I replaced the hardware referenced by c1t1d0 with new hardware also referenced by c1t1d0.  Upon this replace ZFS resilvers the 2TB drive and we can see in zfs status zones that it does it very quickly:

  pool: zones
 state: ONLINE
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 02:54:13 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

errors: No known data errors

Excelsior!  We have a fully functional ZFS mirror.

Had I been simulating a planned drive replacement I would have used zpool offline zones c1t1d0 prior to physically replacing the drive.  This is the correct approach if you are hotswapping drives.  I expect that failing to do this while hotswapping may cause serious problems.  If I had a hotswap bay I would definitely try it, because with hundreds of servers some operator (possibly me) at some point will accidentally pull the wrong drive in a running system.  It is worth knowing what happens.  Regrettably I don't have the gear so I can't try this yet.  There's an idea for my next expense claim.

The next step in this process is to replace the other 1TB drive with my other WD 2TB Red NAS drive and see if ZFS will give me a 2TB pool as expected.

A slight detour first.  Can I restore the original array configuration with two 1TB drives?  Let's say the 2TB drives are needed elsewhere.  I replace the 2TB drive with the original 1TB and again forget to offline the drive and zpool status zones yields:

  pool: zones
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 02:54:13 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            15137371295940956084  FAULTED      0     0     0  was /dev/dsk/c1t1d0s0

First step: replace in place.  I run zpool replace zones c1t1d0 c1t1d0 to bring the disk back to life.  Here is the output of zpool status zones caught during recovery:

  pool: zones
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 22 03:37:57 2012
    258M scanned out of 6.26G at 43.0M/s, 0h2m to go
    255M resilvered, 4.03% done
config:

        NAME                        STATE     READ WRITE CKSUM
        zones                       DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            c1t0d0                  ONLINE       0     0     0
            replacing-1             UNAVAIL      0     0     0
              15137371295940956084  FAULTED      0     0     0  was /dev/dsk/c1t1d0s0/old
              c1t1d0                ONLINE       0     0     0  (resilvering)

You can see the resilvering status right in there this time.  I also learned that offlining the drive is unnecessary in this case of a cold disk replacement.  It is probably still the right thing to do and is what I will do to bring my 2TB drive back in now.  Running zpool status zones before moving forward brings us:

  pool: zones
 state: ONLINE
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

All is well again in ZFS land.  To mix it up I am going to replace c1t0d0 (SATA-0) instead of c1t1d0 (SATA-1) this time.  I will start by running zpool offline zones c1t0d0.  My immediate thought is that this is a solid way to nuke myself.  If I replace the wrong drive, what will happen?  It is interesting enough to try, so I am going there.

I put the 2TB in c1t1d0 (SATA-1) and I am shocked that it boots, but it does.  Trusty zpool status zones gives us a degraded state with c1t1d0 showing with a numeric ID and showing FAULTED, which is much better than I expected:

  pool: zones
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            14628442290675348894  FAULTED      0     0     0  was /dev/dsk/c1t1d0s0

So even with deliberate stupidity it seems I am safe.  This is very positive.  I am going in to put the disks the way they should be for my next step.  I will remove the misplaced 2TB drive from SATA-1 and replace it with the original 1TB from SATA-1 that I replaced.  I will then remove the 1TB from SATA-0 and put the 2TB in SATA-0.  I'm feeling lucky so I'm going to do this live with the case open and not hotswap bay.  This is likely a very stupid idea that could electrocute me, fry the motherboard or the drives, but hey, I live on the edge.

So here's the hotswap procedure as I understand it.  I will zpool offline zones c1t1d0 to prepare the erroneously placed 2TB drive for removal.  After this zpool status zones gives me:

  pool: zones
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            14628442290675348894  OFFLINE      0     0     0  was /dev/dsk/c1t1d0s0

Next, I will unplug the SATA-1 data cable (the wider one) and then its power (the narrower one).  This can only be done with SATA power connectors.  Do not attempt this with Molex or you are guaranteed some serious trouble.  If this fries your computer it is on you.  Okay zpool status zones:

  pool: zones
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            14628442290675348894  OFFLINE      0     0     0  was /dev/dsk/c1t1d0s0

As expected, the drive is still offline.  Let's zpool online zones c1t1d0 and see what happens.  Disappointment unfortunately is the result.  Even though it is the correct drive take a look at zpool status zones:

  pool: zones
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 03:39:44 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            14628442290675348894  UNAVAIL      0     0     0  was /dev/dsk/c1t1d0s0

errors: No known data errors

This indicates that the drive isn't recognized as the drive it was.  This is curious.  I wonder what happens with an in-place zpool replace zones c1t1d0 c1t1d0.  It indicates that the drive doesn't even exist:

cannot open 'c1t1d0': no such device in /dev/dsk
must be a full path or shorthand device name

Why is that?  Let's check if SmartOS even see this poor-man's "hot-swapped" disk.  When I run format I get the following:

Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c1t0d0 <ATA-WDC WD10EARX-32N-AB51-931.51GB>
          /pci@0,0/pci1028,293@1f,2/disk@0,0
Specify disk (enter its number): ^D

So.  This T3500 does not do hotswap apparently.  That is a shame.  What happens when I reboot?  My hypothesis is that the zpool zones is going to come back fine.  The world is at it should be as I can see from zpool status zones:

  pool: zones
 state: ONLINE
  scan: resilvered 11.6M in 0h0m with 0 errors on Thu Nov 22 05:13:10 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

errors: No known data errors

So, stupidity followed by insanity and ZFS keeps it afloat.  What about growing the array to 2TB like I planned?  This time I will shut down first.  As zpool status zones reports the drive is faulted as expected:

  pool: zones
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-4J
  scan: resilvered 11.6M in 0h0m with 0 errors on Thu Nov 22 05:13:10 2012
config:

        NAME                      STATE     READ WRITE CKSUM
        zones                     DEGRADED     0     0     0
          mirror-0                DEGRADED     0     0     0
            c1t0d0                ONLINE       0     0     0
            14628442290675348894  FAULTED      0     0     0  was /dev/dsk/c1t1d0s0

Next, obvious step as before is zpool replace zones c1t1d0 c1t1d0.  As before resilvering begins as reported by zpool status zones:

  pool: zones
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 22 05:36:21 2012
    320M scanned out of 6.26G at 79.9M/s, 0h1m to go
    316M resilvered, 4.98% done
config:

        NAME                        STATE     READ WRITE CKSUM
        zones                       DEGRADED     0     0     0
          mirror-0                  DEGRADED     0     0     0
            c1t0d0                  ONLINE       0     0     0
            replacing-1             UNAVAIL      0     0     0
              14628442290675348894  FAULTED      0     0     0  was /dev/dsk/c1t1d0s0/old
              c1t1d0                ONLINE       0     0     0  (resilvering)

errors: No known data errors

I am amazed by how quiet both computers and drives have become.  Even with the case open this machine is surprisingly quiet.  I check zpool status zones until I see that resilvering is complete (there is probably a smarter way to do this):

  pool: zones
 state: ONLINE
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:37:57 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

errors: No known data errors

Now another power down and drive replacement should get me a 2TB pool.  Let's check the size of the pool first with zpool list zones:

NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
zones   928G  6.26G   922G         -     0%  1.00x  ONLINE  -

As expected, the pool is a little under 1TB usable even though there is a 2TB drive in there.  Once the other 2TB drive is in I expect it will be slightly under 2TB in size.  Now I power off the machine again and replace the SATA-0 1TB WD Green Caviar drive with the other 2TB WD Red NAS drive.  With SATA-0 being the swapped drive zpool status zones now shows c1t0d0 as a numeric and as FAULTED:

  pool: zones
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:37:57 2012
config:

        NAME                     STATE     READ WRITE CKSUM
        zones                    DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            1897234209816080938  UNAVAIL      0     0     0  was /dev/dsk/c1t0d0s0
            c1t1d0               ONLINE       0     0     0

Just as I did before, but with the other drive now, bringing this drive online is as easy as zpool replace zones c1t0d0 c1t0d0.  As with the previous replaces in the zpool mirror zpool status zones will show c1t0d0 resilvering under replacing-0 under mirror-0 and will show the original drive ID being replaced:

  pool: zones
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 22 05:55:55 2012
    1.40M scanned out of 6.26G at 1.40M/s, 1h16m to go
    1.04M resilvered, 0.02% done
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            replacing-0            DEGRADED     0     0     0
              1897234209816080938  FAULTED      0     0     0  was /dev/dsk/c1t0d0s0/old
              c1t0d0               ONLINE       0     0     0  (resilvering)
            c1t1d0                 ONLINE       0     0     0

errors: No known data errors

The 2TB drive in SATA-0 is resilvered in roughly a minute as it was before according to zpool status zones:

  pool: zones
 state: ONLINE
  scan: resilvered 6.26G in 0h1m with 0 errors on Thu Nov 22 05:58:55 2012
config:

        NAME        STATE     READ WRITE CKSUM
        zones       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0

How big is the zones zpool now?  I can check that with zpool list zones:

NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
zones   928G  6.26G   922G      932G     0%  1.00x  ONLINE  -

Curious why it is still only roughly 1TB rather than 2TB? At the moment ZFS still thinks we have two 1TB drives.  Telling it zpool online -e zones c1t0d0 says that we want ZFS to expand the zones pool to the physical extent of c1t0d0.  Now with zpool list zones we see the expected space of 2TB mirrored:

NAME    SIZE  ALLOC   FREE  EXPANDSZ    CAP  DEDUP  HEALTH  ALTROOT
zones  1.81T  6.26G  1.81T         -     0%  1.00x  ONLINE  -

I was inspired to take this on and to push for SmartOS in our QA environment by reading Constantin Gonzalez Schmitz excellent OpenSolaris ZFS: Mirroring vs. other RAID schemes.  Thank you Constantin and also thank you very much to Joyent for releasing the excellent SmartOS cloud OS.  SmartOS and ZFS are excellent examples of Concise Software.