The Only Disaster Recovery Guide You Will Ever Need
Disaster recovery (DR) refers to the security planning area that aims to protect your organization from the negative effects of significant adverse events. It allows an organization to either maintain or quickly resume its mission-critical functions following a data disaster without incurring significant loses in business operations or revenues.
Disasters come in different shapes and sizes. They do not only refer to catastrophic events such as earthquakes, tornadoes, or hurricanes, but also security incidents such as equipment failures, cyber-attacks, or even terrorism classified as disasters.
In preparation, organizations and companies create DR plans detailing processes to follow and actions to take to resume their mission-critical functions.
What is Disaster Recovery?
Disaster recovery focuses on IT systems that help support an organization’s critical business functions. It is often associated with the term business continuity, but the two are not entirely interchangeable. DR is part of business continuity. It focuses more on keeping all business aspects running despite disasters.
Since IT systems have become critical to business success, disaster recovery is now a primary pillar within the business continuity process.
Most business owners do not usually consider that they may be victims of a natural disaster until an unforeseen crisis happens, which ends up costing their company a lot of money in operational and economic losses. These events can be unpredictable, and as a business owner, you cannot risk not having a disaster preparedness plan in place.
What Kind of Disasters Do Businesses Face?
Business disasters can either be technological, natural or human-made. Examples of natural disasters include floods, tornadoes, hurricanes, landslides, earthquakes, and tsunamis. Whereas, human-made and technological disasters involve things like hazardous material spills, power or infrastructural failure, chemical and biological weapon threats, nuclear power plant blasts or meltdowns, cyberattacks, acts of terrorism, explosions, and civil unrest.
Potential disasters to plan for include:
- Application failure
- VM failure
- Host failure
- Rack failure
- Communication failure
- Datacenter disaster
- Building or campus disaster
- Citywide, regional, national, and multinational disasters
Why You Need DR
Regardless of size or industry, when unforeseen events take place, causing daily operations to come to a halt, your company needs to recover quickly to ensure that you continue providing your services to customers and clients.
Downtime is perhaps among the biggest IT expenses that a business faces. Based on 2014-2015 disaster recovery statistics from Infrascale, one hour of downtime can cost small businesses as much as $8,000, mid-size companies $74,000, and large organizations $700,000.
For small and mid-sized businesses (SMBs), extended loss of productivity can lead to the reduction of cash flow through lost orders, late invoicing, missed delivery dates, and increased labor costs due to extra hours resulting from downtime recovery efforts.
If you do not anticipate your businesses’ major disruptions and address them appropriately, you risk incurring long-term negative consequences and implications as a result of the occurrence of unexpected disasters.
Having a DR plan in place can save your company from multiple risks, including:
- Reputation loss
- Out of budget expenses
- Data loss
- Negative impact on your clients and customers
As businesses become more reliant on high availability, their tolerance for downtime has decreased. Therefore, many have a DR in place to prevent adverse disaster effects from affecting their daily operations.
The Essence of DR: Recovery Point and Recovery Time Objectives
The two critical measurements in DR and downtime are:
- Recovery Point Objective (RPO): It refers to the maximum age of files that your organization must recover from its backup storage to ensure its normal operations resume after a disaster. It determines the minimum backup frequency. For instance, if your organization has a four-hour RPO, its system must back up every four hours.
- Recovery Time Objective (RTO): It refers to the maximum amount of time your organization requires to recover its files from backup and resume normal operations after a disaster. Therefore, RTO is the maximum downtime amount that your organization can handle. If the RTO is two hours, then your operations can’t be down for a period longer than that.
Once you identify your RPO and RTO, your administrators can use the two measures to choose optimal disaster recovery strategies, procedures, and technologies.
To recover operations during tighter RTO windows, your organization needs to position its secondary data optimally to make it easily and quickly accessible. One method used to restore data quickly is recovery-in-place because it moves all backup data files to a live state, which eliminates the need to move it across a network. It can protect against server and storage system failure.
Before using recovery-in-place, your organization needs to consider three things:
- Its disk backup appliance performance
- The time required to move all data from its backup state to a live one
Also, since recovery-in-place can sometimes take up to 15 minutes, replication may be necessary if you want a quicker recovery time. Replication refers to the periodic electronic refreshing or copying of a database from computer server A to server B, which ensures that all users in the network always share the same information level.
Disaster Recovery Plan (DRP)
A disaster recovery plan refers to a structured, documented approach with instructions put in place to respond to unplanned incidents. It’s a step-by-step plan that consists of the precautions put in place to minimize a disaster’s effects so that your organization can quickly resume its mission-critical functions or continue to operate as usual.
Typically, DRP involves an in-depth analysis of all business processes and continuity needs. What’s more, before generating a detailed plan, your organization should perform a risk analysis (RA) and a business impact analysis (BIA). It should also establish its RTO and RPO.
1. Recovery Strategies
A recovery strategy should begin at the business level, which allows you to determine the most critical applications to run your organization. Recovery strategies define your organization’s plans for responding to incidents, while DRPs describe in detail how you should respond.
When determining a recovery strategy, you should consider issues such as:
- Resources available such as people and physical facilities
- Management’s position on risk
- Third-party vendors
Management must approve all recovery strategies, which should align with organizational objectives and goals. Once the recovery strategies are developed and approved, you can then translate them into DRPs.
2. Disaster Recovery Planning Steps
The DRP process involves a lot more than only writing the document. A business impact analysis (BIA) and risk analysis (RA) help determine areas to focus resources in the DRP process.
The BIA is useful in identifying the impacts of disruptive events, which makes it the starting point for risk identification within the DR context. It also helps generate the RTO and RPO.
The risk analysis identifies vulnerabilities and threats that could disrupt the normal operations of processes and systems highlighted in the BIA. The RA also assesses the likelihood of the occurrence of a disruptive event and helps outline its potential severity.
A DR plan checklist has the following steps:
- Establishing the activity scope
- Gathering the relevant network infrastructure documents
- Identifying severe threats and vulnerabilities as well as the organization’s critical assets
- Reviewing the organization’s history of unplanned incidents and their handling
- Identifying the current DR strategies
- Identifying the emergency response team
- Having the management review and approve the DRP
- Testing the plan
- Updating the plan
- Implementing a DR plan audit
3. Creating a DRP
An organization can start its DRP with a summary of all the vital action steps required and a list of essential contacts, which ensures that crucial information is easily and quickly accessible.
The plan should also define the roles and responsibilities of team members while also outlining the criteria to launch the action plan. It must then specify, in detail, the response and recovery activities. The other essential elements of a DRP template include:
- Statement of intent
- The DR policy statement
- Plan goals
- Authentication tools such as passwords
- Geographical risks and factors
- Tips for dealing with the media
- Legal and financial information
- Plan history
4. DRP Scope and Objectives
A DRP can range in scope (i.e., from basic to comprehensive). Some can be upward of 100 pages.
DR budgets can vary significantly and fluctuate over time. Therefore, your organization can take advantage of any free resources available such as online DR plan templates from the Federal Emergency Management Agency. There is also a lot of free information and how-to articles online.
A DRP checklist of goals includes:
- Identifying critical IT networks and systems
- Prioritizing the RTO
- Outlining the steps required to restart, reconfigure, or recover systems and networks
The plan should, at the very least, minimize any adverse effects on daily business operations. Your employees should also know the necessary emergency steps to follow in the event of unforeseen incidents.
Distance, though important, is often overlooked during the DRP process. A DR site located close to the primary data center is ideal in terms of convenience, cost, testing, and bandwidth. However, since outages differ in scope, a severe regional event may destroy both the primary data center and its DR site when the two are located close together.
5. Types of Disaster Recovery Plans
You can tailor a DRP for a given environment.
- Virtualized DRP: Virtualization allows you to implement DR using an efficient and straightforward way. Using a virtualized environment, you can create new virtual machines (VMs) instances immediately and provide high availability application recovery. What’s more, it makes testing easier to achieve. Your plan must include validation ability to ensure that applications can run faster in DR mode and return to normal operations within the RTO and RPO.
- Network DRP: Coming up with a plan to recover a network gets complicated with the increase in network complexity. Ergo, it is essential to detail the recovery procedure step-by-step, test it correctly, and keep it updated. Under a network DRP, data is specific to the network; for instance, in its performance and networking staff.
- Cloud DRP: A cloud-based DR can range from file backup to a complete replication process. Cloud DRP is time-, space-, and cost-efficient; however, maintaining it requires skill and proper management. Your IT manager must know the location of both the physical and virtual servers. Also, the plan must address security issues related to the cloud.
- Data Center DRP: This plan focuses on your data center facility and its infrastructure. One key element in this DRP is an operational risk assessment since it analyzes the key components required, such as building location, security, office space, and power systems and protection. It must also address a broader range of possible scenarios.
Disaster Recovery Testing
Testing substantiates all DRPs. It identifies deficiencies in the plan and provides opportunities to fix any problems before a disaster occurs. Testing can also offer proof of the plan’s effectiveness and hits RPOs.
IT technologies and systems are continually changing. Therefore, testing ensures that your DRP is up to date.
Some reasons for not testing DRPs include budget restrictions, lack of management approval, or resource constraints. DR testing also takes time, planning, and resources. It can also be an incident risk if it involves the use of live data. However, testing is an essential part of DR planning that you should never ignore.
DR testing ranges from simple to complex:
- A plan review involves a detailed discussion of the DRP and looks for any missing elements and inconsistencies.
- A tabletop test sees participants walk through the plan’s activities step by step. It demonstrates whether DR team members know their duties during an emergency.
- A simulation test is a full-scale test that uses resources such as backup systems and recovery sites without an actual failover.
- Running in disaster mode for a period is another method of testing your systems. For instance, you could failover to your recovery site and let your systems run from there for a week before failing back.
Your organization should schedule testing in its DR policy; however, be wary of its intrusiveness. This is because testing too frequently is counter-productive and draining on your personnel. On the other hand, testing less regularly is also risky. Additionally, always test your DR plan after making any significant system changes.
To get the most out of testing:
- Secure management approval and funding
- Provide detailed test information to all parties concerned
- Ensure that the test team is available during the test date
- Schedule your test correctly to ensure that it doesn’t conflict with other activities or tests
- Confirm that test scripts are correct
- Verify that your test environment is ready
- Schedule a dry run first
- Be prepared to stop the test if needed
- Have a scribe take notes
- Complete an after-action report detailing what worked and what failed
- Use the results gathered to update your DR plan
Disaster Recovery-as-a-Service (DRaaS)
Disaster recovery-as-a-service is a cloud-based DR method that has gained popularity over the years. This is because DRaaS lowers cost, it is easier to deploy, and allows regular testing.
Cloud testing saves your company money because they run on shared infrastructure. They are also quite flexible, allowing you to sign up for only the services you need, and you can complete your DR tests by only spinning up temporary instances.
DRaaS expectations and requirements are documented and contained in a service-level agreement (SLA). The third-party vendor then provides failover to their cloud computing environment, either on a pay-per-use basis or through a contract.
However, cloud-based DR may not be available after large-scale disasters since the DR site may not have enough room to run every user’s applications. Also, since cloud DR increases bandwidth needs, the addition of complex systems could degrade the entire network’s performance.
Perhaps the biggest disadvantage of the cloud DR is that you have little control over the process; thus, you must trust your service provider to implement the DRP in the event of an incident while meeting the defined recovery point and recovery time objectives.
Costs vary widely among vendors and can add up quickly if the vendor charges based on storage consumption or network bandwidth. Therefore, before selecting a provider, you need to conduct a thorough internal assessment to determine your DR needs.
Some questions to ask potential provider include:
- How will your DRaaS work based on our existing infrastructure?
- How will it integrate with our existing DR and backup platforms?
- How do users access internal applications?
- What happens if you cannot provide a DR service we need?
- How long can we run in your data center after a disaster?
- What are your failback procedures?
- What is your testing process?
- Do you support scalability
- How do you charge for your DR service?
Disaster Recovery Sites
A DR site allows you to recover and restore your technology infrastructure and operations when your primary data center is unavailable. These sites can be internal or external.
As an organization, you are responsible for setting up and maintaining an internal DR site. These sites are necessary for companies with aggressive RTOs and large information requirements. Some considerations to make when building your internal recovery site are hardware configuration, power maintenance, support equipment, layout design, heating and cooling, location, and staff.
Though much more expensive compared to an external site, an internal DR site allows you to control all aspects of the DR process.
External sites are owned and operated by third-party vendors. They can either be:
- Hot: It’s a fully functional data center complete with hardware and software, round the clock staff, as well as personnel and customer data.
- Warm: It’s an equipped data center with no customer data. Clients can install additional equipment or introduce customer data.
- Cold: It has the infrastructure in place to support data and IT systems. It, however, has no technology until client organizations activate DR plans and install equipment. It, sometimes, supplements warm and hot sites during long-term disasters.
Disaster Recovery Tiers
During the 1980s, two entities, the SHARE Technical Steering Committee and International Business Machines (IBM) came up with a tier system for describing DR Service levels. The system showed off-site recoverability with tier 0 representing the least amount and tier 6 the most.
A seventh tier was later added to include DR automation. Today, it represents the highest availability level in DR scenarios. Generally, as the ability to recover improves with each tier, so does the cost.
The Bottom Line
The preparation for a disaster is not easy. It requires a comprehensive approach that takes everything into account and encompasses software, hardware, networking equipment, connectivity, power, and testing that ensures disaster recovery is achievable within RPO and RTO targets. Although implementing a thorough and actionable DR plan is no easy task, its potential benefits are significant.
Everyone in your company must be aware of any disaster recovery plan put in place, and during implementation, effective communication is essential. It is imperative that you not only develop a DR plan but also test it, train your personnel, document everything correctly, and improve it regularly. Finally, be careful when hiring the services of any third-party vendor.
Need an enterprise-level disaster recovery plan for your organization? Veritas can help. Contact us now to receive a call from one of our representatives.
The Veritas portfolio provides all the tools you need for a resilient enterprise. From daily micro disasters to a “black swan” event, Veritas covers at scale. Learn more about Veritas Resiliency Platform, and download a free trial today.
Also recommended for you:
Need an enterprise-level data protection plan for your organization? We can help.