Wednesday, January 17, 2018

Power?  Who Needs Power?

 

Failing To Plan Ahead

Recently I took part in a major data center operation, involving a power cutover.  The general plan was to split power for all multi-P/S devices between two feeds.  One feed would go offline, get moved to new infrastructure, come online, and then wash-rinse-repeat with the other feed.  The process would take some time, but should have been simple.  "Should have been."


Ooops?

When the appointed hour arrived for cutting 1/2 of the power, breakers were thrown, switches were switched and most everything with multi-P/S setups remained 1/2 lit.  Except the big NetApp SAN.  If you've never yanked power from a large piece of gear, I can tell you the ka-thunk, following by a gentle, wheezy spinning down of disks and fans, is quite sickening.   Compound that with all the hosts that suddenly had their FC and NFS storage ripped from them, and you pile headache upon headache.  And, truth be told, I'm the only monkey qualified here to fix that sort of mess.  Pushing aside the desire to kick people, I started to ponder the next move.

It's DEAD, Jim!

First things first - GIVE ME BACK MY ANGRY ELECTRONS!!!  After calmly saying I needed power back RFN, the electricians did just that.  At this point, I like the situation to a falling knife - you never, ever try to catch a falling knife.  Move, let it fall, and pick up the pieces later.  Failure to take this advice leads to serious injury.  In this case, it would otherwise lead to serious data consistency issues.  

The SAN died, and now it was booting.  Slowly.  Booting.  I would have to wait 5-7 minutes for the nodes, shelves, and switches to come up before I could determine how bad things would get.  Only later would I be able to start considering the hosts.  I frankly told the boss, "I don't know what to do.  No one does this in a 'modern' data center - ever.  You plan at all costs to AVOID this."  That answer was not liked, so I did what any good sysadmin does - I went back to my desk and waited.  

Signs of Life

A few days prior I had connected a serial console cable to one of the two nodes involved.  At least I wouldn't have to do that much.  I SSH'd to the terminal concentrator and poked my head into the node.  I ran the following:

set -priv diag
cluster ring show

The node I happened to be on was up.  The partner node, however, was not, and numerous RPC errors streamed by.  That is typical, and more likely since the cluster interconnect switches were not up.  

Continuing to watch, the output of cluster ring show demonstrated continued convergence and coherence of the cluster.  Ten minutes later, the nodes, switches, and shelves were all green and happy.  What saved me I can only attribute to NetApp's battery-backed NVRAM.  In order to speed things up, NetApp adopted a system where commits went to fast cache, to be written to much slower disks later.  In the event of catastrope, that cache is backed with onboard batteries.  ONTAP will try to take that cache and put the pieces back together when the filer comes up.  Nice!


So Now What?

The biggest problems showed up on the hosts.  NFS clients managed to mostly recover gracefully, a testament to the protocol.  Most LUNs were intact, with the exception of 3 (out of 80 or so).  Two LUNs used ZFS, and ZFS's awesome rollback features fixed those two (with manual intervention) the next day.  The last LUN corrupted a database.  An important database.  Fortunately, that was easily remedied, a topic I'll cover in another post.

The biggest problem arrived with one of several Oracle Solaris Cluster 4.x instances.  The primary production cluster, having lost backing store, AND having lost a quorum device, did what it was supposed to - panic.  Yes folks, when OSC encounters a possible split-brain scenario, disks are fenced off from the failing node and that failing node (by way of rgmd) intentionally panics the box.  It's the most sensible solution, since a server can easily be rebooted, but corrupt data can take days, weeks, or even months to recover.  

Once the backing FC storage returned, the surviving node had its LUNs and its quorum device.  Oracle and SAP applications threw the requisite hissy fits, but that's an easy fix - 

clrg offline -Z <zonecluster> <RG>
clrg suspend -Z <zonecluster> <RG>
clrg online -Z  <zonecluster> <RG>

Note that the 'suspend' issues because if the RG comes online with a resource failing, a failover will commence.  In a 'pick up the pieces' situation, you don't want that.  If nothing else, it just adds time to the recovery process.  

By 6:30 that morning, all production systems had returned to normal, with a few non-critical items remaining on the list.  After 36 hours of awake time, I went home for a nap.  Overall, production primary went offline for about 90 minutes.  Production secondary for an additional 3 hours, and the other hour or so I spent doing health checks and cleaning.  

Aftermath

Most things are back to normal now, with the sole remaining issue being a NetApp cluster peering problem.  It's quite novel (to me).  A peering relationship between the local SAN and a remote SAN shows up as "Unhealthy" on the local SAN, but as healthy on the remote.  Routing works, ping works, VSM replication works.  I've submitted this one to the support folks at NetApp.  I think it's because one intercluster LIF  on the local SAN is not on it's home port.  That's easy enough to fix, but I'd like to know before I bump the LIF.  If that's not it, the LIF bump can wait for a maintenance window while I focus on the peer issue.

Overall, for such a major screw up, the end results haven't been too bad.  When you invest in good hardware, good software, and continuous training, you at least stack the odds in your favor.  The temptation was strong to throw up my hands and say "Screw it.  You broke it, you fix it!", especially when sleep deprived.  But, a certain pride in seeing something through took over.  I admit, I would have much rather been curled up in bed with the wife.  But, as other admins out there know - we do what we have to do!

 



Friday, August 4, 2017

Simple Disaster Recovery with Netapp & ONTAP

So, The Building Fell Down

 

Planning Means Practicing To Fail

First off, no buildings were harmed in the production of this blog post.  But fires, earthquakes, floods, and security breaches will bring down your business.  When the unthinkable happens, you had better have a solid DRP (disaster recovery plan) or BCP (business continuity plan) in place.  More than that, you better have a playbook for this plan, and you must know that it works.  All the plans in the world are potentially useless if they are flawed, untested, and unverified.  

Moving A Lot of Bytes Around

A very smart person I know views the world as data.  To him, everything is data in some way, and I don't think that's an unreasonable outlook.  To the point, businesses need people and processes, but they also need their data to allow the people to interact with the processes.  If a building falls down, say after hours, you may know all the processes and you may save all the people.  What good is that to the company if there is no data to work with?  You need copies offsite, and more than just something on tertiary storage (tape).  For rapid recovery, you need hot copies of data on secondary storage (disk), ready to go at a moment's notice.  One way to do this is with ONTAP and Volume SnapMirror.

Because of the variations in how it is used, I won't go into any detailed examples.  Suffice it to say that using VSM to mirror critical data to another site is the heart of your recovery strategy as a storage/security person.  

Often this data is in a database, and VSM'ing a hot database will lead to either 1) an unrecoverable dataset, or 2) the need to replay transaction logs to fix the database.  The first one means now you're automatically going to tape for a long and painful recovery (bad).  The second means you might or might not be going to tape, but recovery is going to take longer than anticipated.  Executives like milestones, and each DRP or BCP has them.  The sooner you reach the milestones and get running again, the sooner E-staff can take the straw out of the Pepto-Bismol bottle.   


Some Like It Hot


Because VSM relies on automated snapshots, you want the process to get a good, quiesced view of the dataset.  When VSM'ing something like an Oracle 12 dataset, be sure your scripts first put the database into hot backup mode.  Your DBA can help you with this, if necessary.  Then, kick off your VSM job, place the DB back into normal or online mode, and let VSM do the rest in the background.  

This will let you get reasonably up-to-date copies of your dataset to an offsite location, with the primary facility acting as the data source and the offsite facility as the destination.  So now you are about 1/3 of the way there, what next?


Attach of the Clones

Assume the worst has happened and now you are going to commence operating out of a backup facility.  Your data is there, your VSM's look good, what do you do?  Most would say "break the VSM relationship from the destination side, map any LUNs to the appropriate FCP (or iSCSI) initiators, present them to the server, and go.  

Not so fast.  Remember that the VSM you have on the destination side may represent the very last known good copy of your data.  Breaking the VSM relationship will make those volumes read-write, meaning they will never be in that state again (assume snapshots get deleted for any number of reasons).  In other words, that last VSM update is a point in time that you may need to preserve for legal, financial, or regulatory reasons.  Instead, clone the VSM destination to a NEW volume AND split the volume.  In other words, make a copy of your data and work off that.  If anything goes wrong in the DR plan and you corrupt some data, you only corrupted a copy, not the master.  

Sure, the split operation will add time to the DR plan, so factor that in ahead of time.  Smart executives (a few of them do exist) will realize the benefit of taking the time to clone the data, and the folks in legal and HR will almost always insist upon it.  


So How Do I...?

I'll make this a basic hit list for your reference.

1.  Create a failover group:
     
network interface failover-groups create -failover-group dr-icl-01 -node cluster1-1 -port e0c
network interface failover-groups create -failover-group dr-icl-01 -node cluster1-1 -port e0d
 
2.  Create intercluster LIFs on each node:

network interface create -vserver cluster1-1 -lif cluster1-1_icl1 -role intercluster -home-node cluster1-1 -home-port e0c -address 10.1.1.90 -netmask 255.255.255.0 -failover-group dr-icl-01 -failover-policy nextavail

network interface create -vserver cluster1-2 -lif cluster1-2_icl1 -role intercluster -home-node cluster1-2 -home-port e0c -address 10.1.1.91 -netmask 255.255.255.0 -failover-group dr-icl-01 -failover-policy nextavail

 3.  Create the peer relationship between the production cluster and the DR cluster:

cluster1::> cluster peer create -peer-addrs <remote_ip_of_peer_intercluster_LIF> -username <UID>

cluster2::> cluster peer create -peer-addrs <remote_ip_of_peer_intercluster_LIF> -username <UID> 


4.  Create your SnapMirror relationships like you normally would on the destination:   

 snapmirror create -S <src_path> -destination-path <dest_path> -type DP -vserver <managing vserver>

 snapmirror initialize  -S <src_path> -destination-path <dest_path>


I generally use the 'admin' user as the remote authenticating ID, though any user with the proper role configuration can be used.  Because I care about data integrity, I take this a step further with a secure cluster peer policy.  To view the existing policy:

cluster1::> cluster peer policy show
Is Unauthenticated Cluster Peer Communication Permitted:  false
                        Minimum Length for a Passphrase:  8

cluster1::> 

 
If unauthenticated communication is permitted, use cluster peer policy modify to change this to 'false'.  In step 3 you will then be prompted for a passphrase.  Use this on the destination side of the configuration to peer with the source, and then do the same on the source side to peer with the destination. 
All this assumes that routing, routing-groups, and name resolution are set up and in place.  This also assumes we're going from CDOT to CDOT.  For those of you that may have 7-Mode in production, and access to a remote CDOT environment, you can still pull this off.  The 'type' will be TDP (transitional data protection), and a big caveat is that you cannot reverse the SnapMirror/VSM relationship to return data from a CDOT environment to a 7-Mode environment.  You have the benefit of DR capability, but it is truly a one-way street.

Wednesday, August 2, 2017

Some Basic Redhat 7 Password Hardening

Build It Right The First Time

Today we'll look briefly at some strategies and considerations for hardening RHEL7 instances, be they physical or virtual.  A general security strategy focuses on two primary areas - the physical and the technical.  If we deploy systems with good security in the first place, we can avoid 'fire drill' exercises and reactive behavior.  Put another way, an ounce of prevention is worth a pound of cure.

Lock It Up

Most IT professionals will not have much say over many aspects of physical security.  Usually facilities staff handles card locks, access keys, power into the datacenter (though power inside the datacenter is another matter!), doors, windows, and other access controls.  This will be the focus of another article, but for the time being, keep your equipment locked up.  Closets, insecure offices, and cubicles are no place for your critical server infrastructure!

From The Top

Security can be viewed as a bottom-to-top or top-to-bottom process.  Whatever it is for you, the goal is the same - minimize risk to the business, employees, vendors, and customers.  That said, here is a brief overview of areas to consider.

A Dumb Thing To Do?

A month ago, I was installing some RHEL 7 instances as VM's - pretty routine and boring stuff - and noticed there was a way to disable shadow passwords.  I haven't seen a sane Unix or Unix-alike OS, since about 1989, that 1) did not mandate the use of /etc/shadow and 2) did give you the option of circumventing /etc/shadow as an install-time option.  Maybe this has been an industry-wide option for all these years, but it certainly it not an option I would have ever used.  

If you disable shadowing, your one-way password hashes will be stored in /etc/passwd, a world-readable file.  All the extra cozy, fuzzy security you get from SHA-512 hashed passwords is diminished by making those hashes available to any user on the system.   That would be a dumb thing to do, so don't do it.  

Complex is Hard, So Make It Hard

A system administrator should also enforce the use of quality password for user accounts.  My investigation of fresh RHEL7 installs shows that quality checks are not necessarily enabled by default.  So, scurry off and check /etc/pam.d/passwd for the following line, and if missing, add it:

password required pam_pwquality.so retry=3

 Then, in /etc/security/pwquality.conf, have the following at a minimum:

minlen = 8 
minclass = 4
maxsequence =0
maxrepeat =2

This requires a password of at least 8 characters, inclusive of all 4 character classes, with 0 permitted sequences, i.e. '1234' or 'abcd' and a maximum of 2 identical consecutive characters, i.e. 'll' 'mm'.  Normally I would make maxrepeat = 0, but this gives users some leeway to help them remember passwords based on, but not actually, double-consonant words.  I prefer to keep minlen at 10 and maxrepeat at 0, but your constraints may lead to different needs.

Like a Fine Wine

This part is easy - make sure password aging is enabled.  Since I sometimes need to have different ages for different account types, each is handled (via scripts) on a case-by-case basis.  For the simplest aging setup, simply run:

chage -M 90 <user> 

This will force the user to pick a new password every 90 days, which is generally considered 'secure enough' in most enterprises.  Of course it would be even more secure to perform authentication from a central store like RHEL IdM or a flavor of LDAP, but that is a post for another time!

Lock Them Out

Intruders are a persistent bunch, and sometimes they'll resort to brute force password guessing.  To intercept and deal with this behavior, make sure you configure the system to lock out user accounts after a given number of failed authentication attempts.  

Add the following lines to the auth sections of /etc/pam.d/system-auth and /etc/pam.d/password-auth:

auth required pam_faillock.so preauth silent audit deny=3 \ unlock_time=1200

auth [default=die] pam_faillock.so authfail audit deny=3 \ unlock_time=1200

and add

account required pam_faillock.so

to the account section of the above files.  
 
This will give a user 3 tries at their password.  On a fourth try, the account is locked for 20 minutes.  I have found this time is long enough for a legitimate user to call in for assistance, which then provides an opportunity to verify the user's identity and perhaps have a little educational chat with them about system security.  

To see who is locked out, the access method, and when the lockout timer started, simply run:

faillock 

I like to poll this data via cron on busy systems and generate reports to find problem users or to identify targeted users.  Knowing who the bad guy is targeting can help you take additional steps to mitigate threats going forward.   

There are many more things you can do to harden just the password subsystem of RHEL7, but I'll end this here.  This is just an introduction to some tips, strategies, configurations, and techniques you may want to consider using in your own physical or virtual RHEL 7 environment.   



Netapp Clustered Data ONTAP oddities

Roasted Spindles

Recently I've been working on an issue with poor performance on a CDOT pair of FAS3240's supporting a VMware environment and running NetApp Release 8.3.2P5.  VMware commentary aside, investigation revealed the current master node in the pair had a root volume seeing disk utilization of >80% at most hours of the day.  While not directly a security issue, this cluster has VM's dealing with ~1TB of daily log aggregation, forwarding, notifications, feeding data to Splunk, forwarding data offsite, and a host of other things.  Logs matter, and you want them handled appropriately. 

The load mix did not seem to impact the disk utilization numbers, and vol0 was your typical small installation 3-disk setup.  Wondering why the utilization was so high, I resorted to poking my head into the shell on the problematic node, and gathering some data with vmstat -s:

sadnode-01% vmstat -s
2384195110 cpu context switches
2722466877 device interrupts
3104969435 software interrupts
1867337226 traps
4090014302 system calls
    63344 kernel threads created
  3827554  fork() calls
  1088783 vfork() calls
        0 rfork() calls
   111503 swap pager pageins
   268359 swap pager pages paged in
    90417 swap pager pageouts
   271672 swap pager pages paged out
   744212 vnode pager pageins
  1955211 vnode pager pages paged in

        0 vnode pager pageouts
        0 vnode pager pages paged out
   469164 page daemon wakeups

458176247 pages examined by the page daemon
 
Compared to another cluster, the pageins and pageouts certainly seemed excessive, as well as the work the page daemon was doing:

happynode-01% vmstat -s
2061309073 cpu context switches
3391879346 device interrupts
2611757802 software interrupts
3300814929 traps
3599776707 system calls
   343228 kernel threads created
 21972759  fork() calls
  9041120 vfork() calls
        0 rfork() calls
     2712 swap pager pageins
     9542 swap pager pages paged in
     2968 swap pager pageouts
    13830 swap pager pages paged out
    55276 vnode pager pageins
   322421 vnode pager pages paged in
        0 vnode pager pageouts
        0 vnode pager pages paged out
    17243 page daemon wakeups
458176247 pages examined by the page daemon


 Since ONTAP is a highly specialized BSD variant, and since I know a little something about Unix, I started to suspect a memory shortfall on sadnode-01, leading to excessive page scanning and paging activity, which in turn would tend to bump up utilization numbers.  In other words, a classic Unix memory shortfall issue.

However, Netapp offers no method (that I know of) for tuning the VM subsystem either from the systemshell or from ONTAP, and any modifications you might make will cause Netapp support to at least raise an eyebrow.  

Seemingly unrelated at first, I also noticed from perfstat and autosupport logs that vol0 was suffering from a moderate amount of block fragmentation.  Latency on vol0 was not excessive, but it was notably higher (>28ms) than it should have been on a typical FAS3240 root volume.  While there may have been a memory shortfall, there was also a structural inefficiency in being able to use regular paging mechanisms to cope with that shortfall.  

By default, ONTAP runs reallocation scans on vol0, which means it attempts to optimize the layout of blocks on vol0 to maximize performance.  As a background process, ONTAP is able to do this on the fly, by automated scheduling.  Sometimes, reallocation never finishes within the allotted time, or simply gets preempted.  The solution is to run the reallocate manually, preferably during off-peak hours.  It is non-disruptive, but it does add some overhead.  On a misbehaving node, run:

cluster::> set -priv diag
cluster::*> system node -node sadnode-01
sadnode-01> reallocate start -o -p /vol/vol0

 This will perform the reallocation, and should take care of the hot spindle problem.  In my real-world example, latency dropped to <10ms and vol0 utilization returned to a typical 5-15%, depending on cluster workload.  I still suspect there is a memory shortfall issue, and perhaps a problem in the underlying swap/paging configuration of ONTAP.  Further investigation is warranted, but for the time being, remember this if you ever run into similar issues.