Wednesday, January 17, 2018

Power?  Who Needs Power?

 

Failing To Plan Ahead

Recently I took part in a major data center operation, involving a power cutover.  The general plan was to split power for all multi-P/S devices between two feeds.  One feed would go offline, get moved to new infrastructure, come online, and then wash-rinse-repeat with the other feed.  The process would take some time, but should have been simple.  "Should have been."


Ooops?

When the appointed hour arrived for cutting 1/2 of the power, breakers were thrown, switches were switched and most everything with multi-P/S setups remained 1/2 lit.  Except the big NetApp SAN.  If you've never yanked power from a large piece of gear, I can tell you the ka-thunk, following by a gentle, wheezy spinning down of disks and fans, is quite sickening.   Compound that with all the hosts that suddenly had their FC and NFS storage ripped from them, and you pile headache upon headache.  And, truth be told, I'm the only monkey qualified here to fix that sort of mess.  Pushing aside the desire to kick people, I started to ponder the next move.

It's DEAD, Jim!

First things first - GIVE ME BACK MY ANGRY ELECTRONS!!!  After calmly saying I needed power back RFN, the electricians did just that.  At this point, I like the situation to a falling knife - you never, ever try to catch a falling knife.  Move, let it fall, and pick up the pieces later.  Failure to take this advice leads to serious injury.  In this case, it would otherwise lead to serious data consistency issues.  

The SAN died, and now it was booting.  Slowly.  Booting.  I would have to wait 5-7 minutes for the nodes, shelves, and switches to come up before I could determine how bad things would get.  Only later would I be able to start considering the hosts.  I frankly told the boss, "I don't know what to do.  No one does this in a 'modern' data center - ever.  You plan at all costs to AVOID this."  That answer was not liked, so I did what any good sysadmin does - I went back to my desk and waited.  

Signs of Life

A few days prior I had connected a serial console cable to one of the two nodes involved.  At least I wouldn't have to do that much.  I SSH'd to the terminal concentrator and poked my head into the node.  I ran the following:

set -priv diag
cluster ring show

The node I happened to be on was up.  The partner node, however, was not, and numerous RPC errors streamed by.  That is typical, and more likely since the cluster interconnect switches were not up.  

Continuing to watch, the output of cluster ring show demonstrated continued convergence and coherence of the cluster.  Ten minutes later, the nodes, switches, and shelves were all green and happy.  What saved me I can only attribute to NetApp's battery-backed NVRAM.  In order to speed things up, NetApp adopted a system where commits went to fast cache, to be written to much slower disks later.  In the event of catastrope, that cache is backed with onboard batteries.  ONTAP will try to take that cache and put the pieces back together when the filer comes up.  Nice!


So Now What?

The biggest problems showed up on the hosts.  NFS clients managed to mostly recover gracefully, a testament to the protocol.  Most LUNs were intact, with the exception of 3 (out of 80 or so).  Two LUNs used ZFS, and ZFS's awesome rollback features fixed those two (with manual intervention) the next day.  The last LUN corrupted a database.  An important database.  Fortunately, that was easily remedied, a topic I'll cover in another post.

The biggest problem arrived with one of several Oracle Solaris Cluster 4.x instances.  The primary production cluster, having lost backing store, AND having lost a quorum device, did what it was supposed to - panic.  Yes folks, when OSC encounters a possible split-brain scenario, disks are fenced off from the failing node and that failing node (by way of rgmd) intentionally panics the box.  It's the most sensible solution, since a server can easily be rebooted, but corrupt data can take days, weeks, or even months to recover.  

Once the backing FC storage returned, the surviving node had its LUNs and its quorum device.  Oracle and SAP applications threw the requisite hissy fits, but that's an easy fix - 

clrg offline -Z <zonecluster> <RG>
clrg suspend -Z <zonecluster> <RG>
clrg online -Z  <zonecluster> <RG>

Note that the 'suspend' issues because if the RG comes online with a resource failing, a failover will commence.  In a 'pick up the pieces' situation, you don't want that.  If nothing else, it just adds time to the recovery process.  

By 6:30 that morning, all production systems had returned to normal, with a few non-critical items remaining on the list.  After 36 hours of awake time, I went home for a nap.  Overall, production primary went offline for about 90 minutes.  Production secondary for an additional 3 hours, and the other hour or so I spent doing health checks and cleaning.  

Aftermath

Most things are back to normal now, with the sole remaining issue being a NetApp cluster peering problem.  It's quite novel (to me).  A peering relationship between the local SAN and a remote SAN shows up as "Unhealthy" on the local SAN, but as healthy on the remote.  Routing works, ping works, VSM replication works.  I've submitted this one to the support folks at NetApp.  I think it's because one intercluster LIF  on the local SAN is not on it's home port.  That's easy enough to fix, but I'd like to know before I bump the LIF.  If that's not it, the LIF bump can wait for a maintenance window while I focus on the peer issue.

Overall, for such a major screw up, the end results haven't been too bad.  When you invest in good hardware, good software, and continuous training, you at least stack the odds in your favor.  The temptation was strong to throw up my hands and say "Screw it.  You broke it, you fix it!", especially when sleep deprived.  But, a certain pride in seeing something through took over.  I admit, I would have much rather been curled up in bed with the wife.  But, as other admins out there know - we do what we have to do!