Accidents may be caused by a natural disaster, force majeure, or human interference. What they all have in common is that they damage the stable operation of the company, causing loss of data, working time, and physical ability to perform the role.
If it is impossible to avoid or prevent the catastrophe itself, then it is quite possible to localize and quickly eliminate the consequences. To do this quickly, efficiently, and reliably, it is imperative to create a disaster recovery checklist in advance. It is impossible to do this without working with big data.
It can affect everyone
Any plan is a deliberately constructed sequence of decisions and actions to achieve a desired result in an allotted time.
Poor planning either results in failure or an unpredictably extended delivery time. This is less critical when it comes to minor goals, but if the goal is important, the plan cannot be neglected.
- Is it important to prepare for a crisis?
- Is it important to act effectively at this moment?
- Is it important to quickly restore work after the crisis?
If the answer to any of these is “Yes”, then a disaster recovery plan is a must.
A disaster recovery plan checklist is also an algorithm
Disaster recovery is a sequence of actions that are aimed at identifying, preventing, and eliminating the consequences of technical failures.
The information systems of any private individual or commercial company can fail suddenly and this failure can adversely affect normal operations. Blackouts, natural disasters, military conflicts — these troubles occur, and the disaster recovery plan documents:
- The company's strategic policy aimed at anticipating and reacting to the impacts of these events
- The decisions and actions to be taken in the process of eliminating the impact
- The prompt elimination of the outcomes.
Stopping work benefits no one
Whatever variety of emergency strikes, disruption to or shutdown of the organization results. This leads to both direct monetary and reputational losses. A disaster recovery plan mitigates these impacts by resolving issues as fast as possible.
- Operational Disruption — The shutdown of the company for emergency reasons causes damage to all employees. Efficient disaster recovery reduces downtime and returns the workflow to its former efficiency.
- Client retention — Disruption almost always affects customers. Business card printing sites and inter-bank settlement applications are products of entirely different scales but the owners and customers of both will be equally upset. The disaster recovery plan provides methods for supporting customers during a business outage.
- "Financial cushion" — A good disaster recovery plan minimizes the cost to return to normal operation. Financial losses will remain but there is no need to increase them through poor or inadequate planning.
The development and implementation of disaster recovery measures strengthen the company's virtual security system. It is particularly in the field of big data protection that qualified assistance may be needed.
Mandatory points of disaster recovery
All IT companies are different. Their disaster recovery plans will therefore be different too, however, two points are universal in terms of system backup.
- Recovery time — The period by which the production system completely reaches the level at which it was operating before the crisis. This time can be either negligible (a few seconds of failure) or quite large (up to several days if a move is necessary). The copy period must be equal to or less than this value. This also applies to the disaster recovery plan.
- Restore point — The period during which data can be lost. For example, two hours of downtime will likely lose data from the last two hours of work. It's easy to recover if you're not in a hurry. But if we are talking about the banking and financial sectors, then two seconds is enough for monetary and reputational losses. The lower the value of this point, the more often you need to copy the data.
Choosing appropriate values for these two indices is key. The choice being made is essentially financial, taking into account the size of the loss caused by any failure. Low values will drive higher recovery plan costs. If the cost is greater than the price, then you need to increase two indicators.
The disease is easier to prevent than to cure
We may not know in advance what failure may occur, but we must always be ready for it. Constant preparation, like an athlete's workout, "pumps up the muscles" of the security system.
Before any disaster, we need (and should):
- Summarize the disaster recovery plan and its main goals
- Assemble the contacts of the emergency restoration team and other responsible employees
- Know and practice the procedure for the team immediately after the emergency
- Design and test the hardware & software restoration plan (algorithm)
- Assemble and document the software licensing information, support desk contacts, support contracts & credentials for all elements of the architecture (close any gaps you find – negotiation is easier when you don’t have a real-time crisis!)
- Understand key communication channels and technological network
- Assemble and document the data of companies that support backup, as well as the office real estate managers
When it's written in the disaster recovery plan, it won't be forgotten. There is nothing worse than trying to locate key contact or licensing information when it’s time critical. Do this when there is no time pressure.
Fear has big eyes
No matter how difficult your situation may be, you should always have three reserves:
- Infrastructure — moving to another location
- Power grids — purchase of fuel & generators
- Internet — another provider or own station
Small businesses need to be thinking about disaster recovery too, even if only as a drawing on a notebook.
A chain is no stronger than its weakest link
A key activity when starting to prepare a disaster recovery plan is an (Honest) assessment of the main vulnerabilities of the production system.
In the range of disasters from "hard down" to "everyone will scatter" there may be numerous weak points. It is important not to lose sight of these when thinking about the disasters which may occur. We don't know where the trouble may strike, therefore, when creating a disaster recovery plan, we need to hone functions, and not sort through cases.
If you don't know where to start, google "Disaster recovery plan templates", find one you like, and change the parameters to suit your situation. Now test it. Create a crisis mentally and work through the plan you have created.
Whatever it is, the backup should contain:
- The composition of the disaster recovery team and the level of access of its members
- A risk and vulnerability assessment
- A recovery budget
- An assessed need for contractors
- What technologies will be used (most likely in this list there will be big data)
- An agreement with the service provider about the resilience level
- What are the recovery methods and reserves?
- How to make a disaster recovery plan (formatting)
The result will eventually need to be coordinated in all departments of the company.
Disaster recovery is impossible without a team. Having decided on the positions, either appoint internal staff or hire from outside. Ensure everyone is clear on their role and how to react in an emergency.
Don't forget to save
Since the level of threat can vary, the disaster recovery plan should contain an approach to back up working information. It is desirable to replicate environments and locations that should not suffer from disaster. Also, the impact analysis should provide a robust mechanism for fast and high-quality data recovery, and ideally, the transfer of processing to a safe location to produce "air gaps".
If you compare a disaster recovery plan to a car's safety system, then backups are the airbags. They help a lot, but only get deployed at the last moment.
The backup process and response procedures can be carried out in parallel with disaster recovery activities.
Evaluation of backups results by parameters looks like this:
- Restore point — Due to technical limitations, backups are made less frequently than in a disaster recovery plan, and data from the last backup, up to the point of crisis may be lost
- Recovery time — The recovery time may increase if data needs to be salvaged from damaged infrastructure, with possible relocation and assessment
- Fault tolerance — Backups by themselves do not contribute to fault tolerance, but do allow you to load to the last saved version
- Infrastructure — Recovery data storage systems, additional hardware, or recovery cloud are required
- Price — This is influenced by the frequency of backup, as well as the depth and volume of data privacy, plus the type of disk or cloud storage
Backup in the sense of emergency copying is limited to small and medium-sized businesses.
To be useful for disaster recovery, backups need to be correctly implemented. So far, nothing better has been invented, except for the “3-2-1” rule, which was formulated by photographer Peter Krogh:
1. You need to have three copies of the data
2. Store them in at least two different places
3. Keep at least one copy outside of the office
The author of the rule was sure that the world is divided into two groups — those who have already lost data, and those who will lose it in the future.
Trust but check
Even if the process of creating and uploading a backup is debugged using a disaster recovery plan, the data recovery itself may not work due to issues such as :
- Corrupted files in the backup
- Backup data is inconsistent
- Backup data not being true (after uploading there were many changes in the system)
To keep your backup system up to date and operational, it makes sense to include periodic testing as part of your ongoing disaster recovery preparation activities.
Inside the plan — another plan
The data recovery plan is the core of the entire disaster recovery procedure, since without backed-up data — there will simply be nothing to work with.
Items in the sub-plan can correspond to a larger plan: recovery point and time, analysis of risks and weaknesses, storage method (cloud or physical media), etc. It is also better to test the data recovery plan more often because data is the essence of IT and it is key to any disaster response.
Always in touch
A critical component of the disaster recovery checklist is the communications plan.
This is a document that systematizes the information exchange, its type, time, and place. It includes the responsibility and availability for ensuring the chosen form of communication.
If this recovery requires working with big data, contact DATAFOREST, who specializes in this type of activity.
Debugging communications occur as in a normal project, with the only difference being that the implementation of the disaster recovery plan has the highest degree of urgency and responsibility. Therefore, the key contributors must be reliable and committed people, either already on the disaster recovery team or called upon as needed. If third parties are involved in the disaster recovery plan, you need to include their representatives in the communication team too.
Clear and understandable
When following the disaster recovery plan, key personnel and other members of the disaster response team should agree on how to communicate. This can be personal communication, a specially created telegram channel, or a Slack group, as long as this method is accessible and convenient for everyone. The quality of communication in a team must comply with the basic principles of information flow.
1. Accuracy of information communicated
2. Correspondence of the message with the meaning of the process
3. The impact of the received information on the actions of the recipient
When executing the disaster recovery testing checklist, there will not be much time for chatter, information that makes it difficult to understand the main meaning of the message, and redundancy (the size of the message) should be avoided. Information should be brief, relevant, accurate, and complete.
When to act quickly
To respond quickly during a crisis, the disaster recovery plan needs to stipulate an emergency alert system
- For the emergency team
- The rest of the company
- Relevant outsourced partners
- Appropriate suppliers
- Appropriate customers
Messages are transmitted in real-time: by calls, emails or text messages.
As part of a disaster recovery plan, emergency communication mechanisms should have up-to-date contact information and be tested periodically.
Practicing communication methods
Emergency communications during disaster recovery require certain skills, that is, it requires training. The procedure is as follows:
1. Analysis of needs and number of performers
2. Definition of executing technologies
3. Calculation of the transmission capacity of the system in an emergency
4. Database development and user training
5. Integration into the overall disaster recovery plan
Reconciliation of items with budgetary capacity is also necessary to determine whether third-party assistants are needed or if internal company reserves can be used.
Disaster recovery technical tools
There are two effective approaches to technically implementing disaster recovery: cloud and physical.
With the cloud, everything is clear — the "cast" is stored either on its own remote server or on a partner one. It is impossible to physically store data at the accident site, so a pre-prepared parallel infrastructure is needed. In this case, we are dealing with the distribution of data over different nodes, i.e. — replication.
There are two types of replication
- Passive — data modification occurs in the primary replica, and instructions are sent to a parallel structure, where calculations and changes take place. If the primary link fails with a short timeout, the passive replica turns on
- Active — modification occurs simultaneously in the primary and reserve structures. If the primary system fails, any of the active ones take over the load
Parallel infrastructure in the context of disaster recovery allows one to recover quickly and not think about the long-term rental of partner equipment or paying for a cloud service.
How to understand that normal operation has been restored?
When the disaster recovery plan has been fully (and successfully) executed, normal operation will have been restored. It's not necessarily the case that everything will be the same as before the crisis. The plan may have transferred data, relocated employees, or distributed work remotely, as well as implemented changes in the composition and functions of the team. But business operations will now be possible, whereas before, they were not.
Inspection of the scene of the tragedy
No one relishes this task.
Let's say you did a disaster recovery test that was so realistic that it was real. After you exhale and stop looking at your watch, the first thing you need to do is evaluate the consequences of the tragedy. Review all the points of the disaster recovery plan and test them for strength. You need to ask yourselves (honestly)
- What worked well?
- What could have been better?
- Where was the plan materially deficient?
- Did you need to relocate people/equipment?
- What are the recovery points and time metrics?
- Whether it is necessary to duplicate the infrastructure?
- How much will this project cost?
The harder and more objectively you analyze the event, the more clearly thin spots will be visible.
Some disaster recovery plans include items like "15 minutes to panic and scream." See if this approach would work well in your business.
As a reminder, this document has considered only the high-level components of a disaster recovery protocol. Everyone has their own points; they depend on the specifics of the company. The only message is that, regardless of the size of the business, such a plan should be at least in a primitive form.
If a more advanced technological model involves work in the areas of big data and DevOps, the DATAFOREST team is waiting for your request.
Plan for the next plan
It has become clear that a disaster recovery plan is good by the very fact of its existence. The task now, having undertaken the review, is to create a new, more perfect plan based on the experience gained. The next plan will be much better, just don't forget to write it down. It will certainly include items related to the collection and processing of big data, this is what DATAFOREST does, leave a request here.
Help in difficult times
People do not always act rationally in emergencies. People panic, or they can shut down or "fall into a stupor". Disaster recovery plans are maps made to guide companies out of emergencies. The key first action when disaster strikes is to pull out the plan and open it. Plans are only useful if they are followed.
The more often the better
Any athlete will say “The more often you train, the better the results will be”
Most academics will say “Flashcards are the only learning scheme proven to have an exponential benefit on repetition.
The more often the system is updated and backed up, the higher the probability of the uninterrupted operation of the company. True, you need to pay for it more often, but here you can compare for yourself what is more — a medal and a prize fund, or is it cheaper not to do this?
You get what you pay for
Experience has shown, time and again, that it's cheaper to have disaster recovery measures in place than to be in a situation where the server went down and the backup was on the server.
Despite the description of the general principles of drawing up a disaster recovery plan, look for very specific solutions on its big data point on the DATAFOREST website.
What is a disaster recovery checklist and why is it important?
Disaster recovery is the procedure, method, and behavior of company employees and outsourcers in the event of a critical situation. In cases like this, a disaster recovery plan is necessary because if it is not there, recovery is uncertain and it is likely that expensive and possibly unrecoverable mistakes will be made.
How do I determine which potential disasters to include in my checklist?
It all depends on the size of your budget and the specifics of the application. Any trouble can be included in the disaster recovery checklist — from a natural disaster to the loss of keys to the server room.
How often should I update my disaster recovery checklist?
Exactly as often as there are changes in those infrastructure and software assets that are critical to the company. You can accumulate disaster recovery changes, but one day it can snowball.
What steps should I take to ensure that my backups are easily recoverable?
Periodic restoration testing in a representative environment – possibly the DR environment, will show that the backup files are not damaged, are consistent, and correspond to the current system configuration due to the disaster recovery checklist.
How do I test my disaster recovery plan to make sure it will be effective in the event of a disaster?
It is necessary to simulate the expected disaster and conduct a training session without interrupting the work process or on a day off: shut down the data center, as if there was a fire; disconnect from the main ISP and try a fallback; spend a working day at a parallel infrastructure facility, etc.