No Days Off (because security never sleeps).: Simple Disaster Recovery with Netapp & ONTAP

So, The Building Fell Down

Planning Means Practicing To Fail

First off, no buildings were harmed in the production of this blog post. But fires, earthquakes, floods, and security breaches will bring down your business. When the unthinkable happens, you had better have a solid DRP (disaster recovery plan) or BCP (business continuity plan) in place. More than that, you better have a playbook for this plan, and you must know that it works. All the plans in the world are potentially useless if they are flawed, untested, and unverified.

Moving A Lot of Bytes Around

A very smart person I know views the world as data. To him, everything is data in some way, and I don't think that's an unreasonable outlook. To the point, businesses need people and processes, but they also need their data to allow the people to interact with the processes. If a building falls down, say after hours, you may know all the processes and you may save all the people. What good is that to the company if there is no data to work with? You need copies offsite, and more than just something on tertiary storage (tape). For rapid recovery, you need hot copies of data on secondary storage (disk), ready to go at a moment's notice. One way to do this is with ONTAP and Volume SnapMirror.

Because of the variations in how it is used, I won't go into any detailed examples. Suffice it to say that using VSM to mirror critical data to another site is the heart of your recovery strategy as a storage/security person.

Often this data is in a database, and VSM'ing a hot database will lead to either 1) an unrecoverable dataset, or 2) the need to replay transaction logs to fix the database. The first one means now you're automatically going to tape for a long and painful recovery (bad). The second means you might or might not be going to tape, but recovery is going to take longer than anticipated. Executives like milestones, and each DRP or BCP has them. The sooner you reach the milestones and get running again, the sooner E-staff can take the straw out of the Pepto-Bismol bottle.

Some Like It Hot

Because VSM relies on automated snapshots, you want the process to get a good, quiesced view of the dataset. When VSM'ing something like an Oracle 12 dataset, be sure your scripts first put the database into hot backup mode. Your DBA can help you with this, if necessary. Then, kick off your VSM job, place the DB back into normal or online mode, and let VSM do the rest in the background.

This will let you get reasonably up-to-date copies of your dataset to an offsite location, with the primary facility acting as the data source and the offsite facility as the destination. So now you are about 1/3 of the way there, what next?

Attach of the Clones

Assume the worst has happened and now you are going to commence operating out of a backup facility. Your data is there, your VSM's look good, what do you do? Most would say "break the VSM relationship from the destination side, map any LUNs to the appropriate FCP (or iSCSI) initiators, present them to the server, and go.

Not so fast. Remember that the VSM you have on the destination side may represent the very last known good copy of your data. Breaking the VSM relationship will make those volumes read-write, meaning they will never be in that state again (assume snapshots get deleted for any number of reasons). In other words, that last VSM update is a point in time that you may need to preserve for legal, financial, or regulatory reasons. Instead, clone the VSM destination to a NEW volume AND split the volume. In other words, make a copy of your data and work off that. If anything goes wrong in the DR plan and you corrupt some data, you only corrupted a copy, not the master.

Sure, the split operation will add time to the DR plan, so factor that in ahead of time. Smart executives (a few of them do exist) will realize the benefit of taking the time to clone the data, and the folks in legal and HR will almost always insist upon it.

So How Do I...?

I'll make this a basic hit list for your reference.

1. Create a failover group:

network interface failover-groups create -failover-group dr-icl-01 -node cluster1-1 -port e0c

network interface failover-groups create -failover-group dr-icl-01 -node cluster1-1 -port e0d

2. Create intercluster LIFs on each node:

network interface create -vserver cluster1-1 -lif cluster1-1_icl1 -role intercluster -home-node cluster1-1 -home-port e0c -address 10.1.1.90 -netmask 255.255.255.0 -failover-group dr-icl-01 -failover-policy nextavail

network interface create -vserver cluster1-2 -lif cluster1-2_icl1 -role intercluster -home-node cluster1-2 -home-port e0c -address 10.1.1.91 -netmask 255.255.255.0 -failover-group dr-icl-01 -failover-policy nextavail

3. Create the peer relationship between the production cluster and the DR cluster:

cluster1::> cluster peer create -peer-addrs <remote_ip_of_peer_intercluster_LIF> -username <UID>

cluster2::> cluster peer create -peer-addrs <remote_ip_of_peer_intercluster_LIF> -username <UID>

4. Create your SnapMirror relationships like you normally would on the destination:

snapmirror create -S <src_path> -destination-path <dest_path> -type DP -vserver <managing vserver>

snapmirror initialize -S <src_path> -destination-path <dest_path>

I generally use the 'admin' user as the remote authenticating ID, though any user with the proper role configuration can be used. Because I care about data integrity, I take this a step further with a secure cluster peer policy. To view the existing policy:

cluster1::> cluster peer policy show
Is Unauthenticated Cluster Peer Communication Permitted: false
Minimum Length for a Passphrase: 8

cluster1::>

If unauthenticated communication is permitted, use cluster peer policy modify to change this to 'false'. In step 3 you will then be prompted for a passphrase. Use this on the destination side of the configuration to peer with the source, and then do the same on the source side to peer with the destination.

All this assumes that routing, routing-groups, and name resolution are set up and in place. This also assumes we're going from CDOT to CDOT. For those of you that may have 7-Mode in production, and access to a remote CDOT environment, you can still pull this off. The 'type' will be TDP (transitional data protection), and a big caveat is that you cannot reverse the SnapMirror/VSM relationship to return data from a CDOT environment to a 7-Mode environment. You have the benefit of DR capability, but it is truly a one-way street.

No Days Off (because security never sleeps).

Friday, August 4, 2017

Simple Disaster Recovery with Netapp & ONTAP