As noted in the first article in this series, IT disaster recovery (DR) strategies and procedures help organisations protect their investments in IT systems and infrastructures.
The essential mission for DR is to return IT operations to an acceptable level of performance as quickly as possible following a disruptive event.
So, upon completion of a risk assessment (RA) and business impact analysis (BIA), we need to examine the critical IT services needed to support the organisation’s critical business activities.
In this article, we’ll look at how to set a disaster recovery strategy and develop detailed DR plans.
Build RPO and RTO into DR strategy
Before we look at DR strategy and planning in detail, we need to consider two vital metrics, namely recovery time objective (RTO) and recovery point objective (RPO).
According to ISO/IEC 27031:2011, the global standard for IT disaster recovery (referred to as information and communication technology, or ICT, in the standard), RTO is “the period of time within which minimum levels of services and/or products and the supporting systems, applications, or functions must be recovered after a disruption has occurred”.
Meanwhile, RPO is “the point in time to which data must be recovered after a disruption has occurred”. Both of these metrics are needed to define DR strategies.
RPO/RTO and the cloud
Note that these two metrics are affected by the use of cloud-based services and cyber security considerations.
For example, the RTO for an on-site datacentre can be easier to compute, as all operations are within the organisation’s own location.
By contrast, when IT operations are offloaded to cloud-based services, RTO must be provided by the cloud supplier, which may or may not be able to offer an acceptable value. The same is true when data is located in a cloud service.
On-site data storage systems make it easier to support RPO values, whereas off-site cloud-based storage providers may not be able to offer a reliable RPO. Both of these concerns make a solid service-level agreement (SLA) highly advisable, as it sets agreed performance levels the third party must support.
Strategy and detailed plans in the DR planning process
Figure 1 depicts the stages of the IT disaster recovery lifecycle and is adapted from ISO 27031:2011. The figure shows that, in addition to strategy development, additional activities must be considered before DR plans can be developed.
Figure 1: Stages of the IT disaster recovery lifecycle
For example, an IT disaster recovery policy is an essential part of the overall DR process. It is, in particular, an important item to be examined during audits, so its development is essential.
A gap analysis, which can be performed after risk assessment and business impact analysis activities if needed, helps pinpoint areas for improvement that can enhance the overall disaster recovery planning process.
Technology performance criteria can be identified from BIAs, RAs and gap analyses, and will be factored into the DR plans. These activities can also identify the resources needed to achieve the desired performance levels. BIAs and RAs must also factor people resources into them, not only during a disruptive event, but also during normal operations.
Strategy definition
Once the critical systems and functions and RTOs and RPOs have been established and approved, the next step is to define strategies for responding to disruptive incidents when they occur.
ISO 27031 states: “Strategies should define the approaches to implement the required resilience so that the principles of incident prevention, detection, response, recovery and restoration are put in place.”
Strategies define “what” is to be done when responding to an incident, while plans describe “how” the response and recovery activities will be performed.
Once critical systems, data, networks, cyber security elements and cloud service firms have been identified, use the example in Table 1 as a starting point to help formulate strategies needed to protect them.
Factors to be considered when developing such a table can include budgets; management’s views regarding risks; cyber security issues; availability of resources, especially cloud services; costs versus benefits; human constraints; technological constraints; and regulatory requirements.
Key factors in DR strategy definition
The following are important issues when developing DR strategies, especially when considering the use of cloud-based services.
People considerations
Among the key issues are availability of staff and/or contractors, training needs of staff and contractors, duplication of critical skills so there can be a primary and at least one backup, available documentation to be used by staff, and follow-up to ensure staff and contractor retention of knowledge.
Use of cloud services introduces additional considerations, such as the security of data and systems, qualifications of cloud provider staff, potential for rogue cloud employees to damage or steal customer resources, willingness of cloud provider representatives to answer questions truthfully, and the ability of cloud provider staff to handle customer requirements.
Physical facilities
Here, we need to consider availability of alternate work areas within the same site, at a different company location, at a third party-provided location, at employees’ homes, and at a transportable work facility (such as a trailer kitted out for work space).
It is also important to consider site security, staff access procedures, ID badges and location of alternate space relative to the primary office site. It may not be possible to physically visit cloud provider facilities, and customer systems and data can be stored at multiple datacentres, so users must be prepared to trust cloud providers to protect their assets in secure and environmentally-safe datacentres.
Technology considerations
This includes things like access to equipment space properly configured for systems (for example, raised floors), suitable heating, ventilation and air-conditioning (HVAC), sufficient primary electrical power, suitable voice and data infrastructure, distance of alternate technology area from primary site, provision for staffing at an alternate technology site, availability of failover (to a backup system) and failback (return to normal operations) technologies to facilitate recovery, the need to support legacy systems, and physical and information security capabilities at the alternate site.
Each of these issues must be carefully addressed when using a cloud service provider. It is advisable to include them in service-level agreements (SLAs) if possible.
Data considerations
Here we have to include timely backup of critical data to a secure storage area in accordance with RTO/RPO requirements, method(s) of data storage (for example, disk, tape, optical), connectivity and bandwidth requirements to ensure all critical data can be backed up in accordance with RTO/RPO timescales, data protection capabilities at alternate storage site, and availability of technical support from qualified third-party service providers.
These considerations are essential when using a cloud service provider, especially its resources for storing and accessing customer systems and data, how they protect their network perimeters from cyber attacks, how they accommodate customer RTO/RPO requirements, and how they test their own DR plans.
Supplier considerations
Here we need to identify and contract with primary and alternate suppliers for all critical systems and processes, and even the sourcing of people. Key areas where alternate suppliers will be important include hardware (servers, racks), power (batteries, UPS, power protection), networks (voice and data network services), repair and replacement of components, and multiple delivery firms (Fedex and UPS).
Many of these issues can be mitigated by using a cloud service provider, but it is still prudent to maintain backups of critical data and applications and have supplies of critical system components.
Policies and procedures
Key steps here include to define policies for IT disaster recovery, have them approved by senior management, define step-by-step procedures (for example, to initiate data backup to secure alternate locations), relocation of operations to an alternate space, recovery of systems and data at the alternate sites, and resumption of operations at either the original site or at a new location. When using cloud services, be sure to factor cloud considerations into all DR policies and related procedural documents.
Finally, be sure to obtain management approval for planned strategies, policies and procedures. Be prepared to demonstrate that the proposed strategies align with the organisation’s business goals and business continuity strategies.
Translating strategies into DR plans
The next step after completing DR strategies is to translate them into disaster recovery plans and procedures. To show how this can be done, Table 1 has been revised into Table 2, which follows.
It shows critical systems and associated threats, the response strategy and (new) response action steps, the recovery strategy and (new) recovery action steps. Performing this step helps define high-level action steps that are made part of the DR plan.
Use Table 2 to expand high-level action steps into detailed step-by-step procedures, as needed. Be sure they are linked in the proper sequence.
Developing DR plans
Disaster recovery plans provide a step-by-step process for responding to a disruptive event.
Procedures should ensure an easy-to-use and repeatable process for recovering damaged IT assets and returning them to normal operation as quickly as possible. If staff relocation to a third-party hot site or other alternate space is necessary, procedures must be developed for those activities. Steps for using cloud-based backup resources must be developed in coordination with the cloud provider, so that procedures are performed in the proper sequence.
Consider also reviewing the global standards ISO/IEC 24762 (Guidelines for information and communications technology disaster recovery services) and ISO/IEC 27035 (Incident response activities) when developing DR plans.
Incident response
In addition to using the strategies previously developed, IT disaster recovery plans should also include an incident response process (ISO/IEC 27035) to address the initial phases of the incident and the steps to be taken.
As in Figure 2, incident response actions should precede disaster recovery actions. When cloud services are used, work with the provider to incorporate their incident response activities into the DR plan.
Figure 2: Disaster timeline
Note: Emergency management has been included in Figure 2, as it represents activities that may be needed to address situations where people are injured or situations such as fires that must be addressed by local fire brigades and other first responders.
The DR plan structure
The following section details the framework and components for a DR plan based on ISO 27031 and ISO 24762.
Best-in-class DR plans often begin with a page or two that summarise key action steps (for example, where to assemble employees if forced to evacuate the building) and lists of key contacts (for example, cloud providers, alternate work areas) and their contact information for ease of authorising and launching the plan.
Introduction
Following the initial emergency pages, DR plans have an introduction that includes the purpose and scope of the plan. This section should specify who has approved the plan, who is authorised to activate it, and include a list of links to any other relevant plans and documents (for example, policies).
Roles and responsibilities
The next section should define roles and responsibilities of DR team members, their contact details, spending limits (for example, if equipment has to be purchased), and limits to their authority in a disaster situation. When cloud services are being used, these same parameters should be defined for the cloud provider.
Incident response
The incident response process identifies the sudden presence of an out-of-normal situation (for example, alerted by various system-level alarms), quickly assesses the situation (and any damage) to make an early determination of its severity, attempts to contain the incident and bring it under control, and notifies management, cloud service providers and other key stakeholders.
Plan activation
Based on findings from incident response activities, the next step is to determine if disaster recovery plans should be launched, and which ones in particular should be invoked. These activities should be carefully coordinated with cloud service providers.
If DR plans are to be invoked, incident response activities can be scaled back or terminated, depending on the incident, allowing for launch of the DR plans. Use of a cloud provider may also help scale back incident response activities, because the cloud provider should be activated early in the process.
This section defines the criteria for launching the plan, coordinating with the cloud provider, what data is needed and who makes the determination.
Included within this part of the plan should be assembly areas for staff (primary and alternates), procedures for notifying and activating DR team members and cloud providers, and procedures for standing down the plan if management determines the DR plan response is not needed.
Document history
Provide a section listing plan document dates and revisions. It should include dates of revisions, what was revised, and who approved the revisions. Locate this section at the front of the plan.
Procedures
Once the plan has been launched, and if cloud providers have also been notified, DR teams and cloud provider teams proceed with response and recovery activities as specified in the plans. The more detailed the plan is, the more likely the affected IT asset will be recovered and returned to normal operation.
It is essential that the cloud provider(s) know their roles during the incident. Enhance DR plans with relevant recovery information and procedures obtained from the cloud provider(s). Coordinate closely with cloud providers while developing DR plans to ensure they have documented emergency procedures.
Appendixes
Located at the end of the plan, these can include systems inventories, application inventories, network asset inventories, contracts and service-level agreements, cloud provider (and other supplier) contact data, and any additional documentation that will facilitate recovery.
Next activities
Once DR plans have been completed, they are ready to be exercised. Exercising DR plans when using a cloud service provider is particularly important, because the cloud provider will have responsibility for recovering critical systems and data. This process will determine if systems and data can be effectively recovered and returned to service as planned.
In parallel to these activities are three additional ones: creating employee awareness, employee training and records management. These are essential in that they ensure employees are fully aware of the DR plans and their responsibilities in a disaster, and DR team members and cloud service representatives have been trained in their roles and responsibilities as defined in the plans.
And since DR planning generates a significant amount of documentation, records management and change management activities should also be initiated. This is especially important when using a cloud service provider and will ensure that customers are fully aware of what the provider should be doing.
Obtain as much provider documentation as possible to keep in sync with their activities. Be sure to coordinate with company records management and change management activities during DR planning.
Summary
This article has demonstrated the importance of developing DR strategies, especially when using cloud service providers, how to translate them into DR plans and incident response activities, and defined the components of a DR plan and the content each contains. Fully defined DR strategies, which are based on numerous factors, especially when working with cloud providers, are essential when developing disaster recovery plans.