It seems the most common assumption is that everything is protected very well.
I decided to write this blog because I’ve had some really interesting discussions about business continuity and disaster recovery over the years and I thought it might be good to share my candid view on how best to manage expectations with Executive Teams on these subjects.
The “what constitutes a disaster” argument
There is a surprisingly repetitive pattern when discussing what existing DR capabilities are in place with Boards and Execs, this seems to nearly always start with the assumption that everything is already protected.
In the absence of information about what protection actually exists it seems the most common assumption is that everything is protected very well. To the point where should a Disaster occur that a recovery would take less than a day – sometimes the opinion is mere hours to recovery. Unfortunately this is rarely the case, and I’ll explain why further on.
The second thing that seems to happen is that nobody can agree what a disaster is. This may sound a bit daft but I can tell you that right now that you are thinking Disaster = X while the next person to read this will be thinking Disaster = Y. This single issue is perhaps the biggest hurdle I have faced when trying to help executive teams take decisive action. It turns into a huge distraction as we have to walk through various scenarios that get steadily more unlikely:
Extended Loss of Power
Extended Loss of Communication lines
Theft or damage through attempted theft of critical infrastructure components
Hurricane or localized violent weather
Terrorist Attack or threat of terrorist attack
Nuclear Weapon Detonation/EMP
What if Zombies attack?
Obviously these last two are tongue in cheek and I usually throw them onto the table just to bring things back into focus, but I kid you not they have come up in conversation. The truth is that a disaster is an event that removes the use of your IT infrastructure for an extended period of time, say for example 48Hrs+ anything less than this can be usually be weathered. Only your executive team can define it specifically for our organization.
Logic dictates that every business or organization should have at least a basic contingency in place about what happens in this eventuality. IT infrastructure is only part of this problem of course and I often use a reference story that I was once told to explain the wider implications.
The business in question had offices which were severely damaged by the bunsfied oil refinery explosion some years ago. When they turned up for work after the event they had little choice but to send everyone home . It was only a week later that they realized that did not appear to have up to date records available for all the employee’s within their DR plan, it was therefore an interesting task trying to reach everyone to let them know it was time to come back to work.
A little bit of planning can save a lot of hassle, a cloud based HR application today would remove this problem entirely and cost next to nothing of course so things are becoming easier by default.
The next huge hurdle we get to debate about is how to categorize the various applications and data from a business continuity point of view. There are obviously some applications that need to be back up and running before others so the business can get back to trading, the problem is that its quite difficult to get agreement from a group of department heads on exactly which ones these are. It is an obvious point of conflict and consequently sometimes ends up being moved around at the bottom of the monthly board meeting’s agenda like the proverbial Brussel Sprouts on your plate at Christmas.
Keeping it simple and providing clear direction
What has become apparent over the years is that it’s far better to develop a strategy that introduces a simple base level of cover for all applications and data, you then invite this to be challenged. This is generally better received and sets up immediate discussion around specific recovery times.
The only caveat around this is email and phone systems. Real time communication is the life blood of the modern office and with the increasing reliance on presence based technology / video conferencing these systems ideally need to be fully redundant, the good news is that it’s not too difficult to achieve this today. Cloud based email services are becoming standard and hosted ip based telephony systems are easy to divert to mobiles or other offices in the event of a disaster.
Some technology specifics
Where possible I always suggest that a client uses storage based replication for virtual machines and data instead of software tools. Storage replication is on the whole far more robust, a lot more efficient and less expensive to manage.
There are now many services available that allow storage to be replicated off-site onto a third party platform, sometimes called data protection as a service or DPaaS this is a simple and easy way of ensuring you have a near line copy of your servers and data offsite and you don’t have to worry about trying to recover from tape should the worst happen. Trying to avoid owning hardware is an objective that should be high on every CIO’s todo list.
Infrastructure is not core to running the business or organisation it is therefore an unneeded distraction, get rid of as much as possible and only pay for resource that you are consuming. Replicating to a cloud service for your DR is the ideal way to move into using this model with a very low risk.
Some caveats I would hear is that depending on the sensitivity of the data you need to ensure that your service provider has some basics in place, i.e. They operate “sovereign” data centers which guarantee to keep your data in your country and protected by your local laws, they solid data protection processes that are audited externally on a regular basis (ISO27001 or equivalent), there are options for encryption for data both when at “rest” and while traveling between sites.
Another option to bear in mind is whether there are archiving options available, this allows you to only keep the most critical data locally removing the need to continually grow your on-site storage footprint.
Looking at DR as a Service
It stands to reason that if your servers and data are all off-site, and providing your service provider has cloud computing resource as well, then you can actually purchase a full DR failover service? This base level of cover should be relatively inexpensive, offer a solid SLA of say 4 hours to recover Tier one applications and require little to no involvement from your team in management or monitoring.
Testing also becomes far smoother and more regular as well, this can be done in the middle of the day using live data on the DR site instead of having to spend a weekend Watching tapes restore onto bare metal servers.
Looking forward the future is looking pretty good. As the world’s IT transforms, adopting cloud based applications that are highly redundant and delivered as a service, we see a shift away from having to worry about our on premise data and servers because they are unlikely to exist in ten years time. The trick is how do we handle things today with this knowledge? My personal view is that we are moving into a very interesting five year period of transformation, its going to be transitional, innovative and exciting with lots of new options that don’t exist today. Those that embrace the “hybrid” concept will see the best returns and sleep more soundly at night.
Move what can be moved (email for instance) to highly resilient cloud based solutions, get them off your desk and concentrate on the SLA of your line of business applications, these too can sometimes be moved to IaaS (Infrastructure as a Service) platforms where for a slightly lower cost you can increase the availability, eliminate DR concerns and have on demand performance/capacity when you need it.
For the the applications and data that has to stay “in-house” (and there is going to be quite a bit of this for some types of organizations) its now easy to purchase prepackaged virtual infrastructure. Its pre-validated (read de-risked here) designs like this that also offer Cloud DR/Backup options so again in the interest of “getting stress off of your desk” the smart money in my opinion is going to be spent on these solutions.
So the final note I’d add is that as the underlying infrastructure that your Apps and Data rely on become commodities the need to worry about DR/BC will dissipate, so look for those opportunities to transform how you run your applications and you’ll never have to worry about those pesky Zombies Attacking.