Disaster Recovery Plans
To test your systems you should simulate the conditions your monitoring systems are designed to catch by...
Make sure the detection thresholds actually fire the alerts like they're supposed to. You'll also want to test your reactions and responses to these alerts.
What do you need to consider when designing a disaster recovery plan
Perform a risk assessment Determine backup and recovery systems Determine Detection and Alert Measures and test systems
What critical operations should be redundant
This includes power delivery or supply, communications systems, data links, and hardware.
What types of things are preventative measures?
This includes things like regular backups and redundant systems. a standard for critical network infrastructure or service to have redundant power supplies.
What questions should you ask when thinking through the impacts of a disaster on your day to day operations?
What would happen to your network if the building lost power? Can you use continue to work at the fiber-optic data line if the building gets damaged by a nearby construction work? You call router for the office just burst into flames.
single point of failure.
When one system in a redundant pair suffers a failure
Usually, a disaster will take out one system that's part of a redundant pair or a replication scheme which would prevent
a complete service outage.
Ideally, you should have regular but automated backups to backup systems located
both on site and off site.
A risk assessment allows you to prioritize
certain aspects of the organizations that are more at risk if there's an unforeseen event.
When looking at detection measures, you'll want to make sure you have a
comprehensive system in place that can quickly detect and alert you to service outages or abnormal environmental conditions.
Preventative measures
cover any procedures or systems in place that will proactively minimize the impact of a disaster. Anything that's done before an actual disaster that's able to reduce the overall downtime of the event
The way to think about designing detection measures is to evaluate what's most
critical to the day-to-day functioning of the organization.
The disaster recovery doc doesn't need to contain the
details of the operations. Links and references are sufficient.
redundant power supply will include
different power sources like battery backup.
Make sure that every important operational procedure is
documented and accessible.
It's also critical that you have data recovery procedures clearly
documented and kept up-to-date.
Other things that should be monitored to help head off any unexpected disasters include
environmental conditions inside several networking rooms. Flood sensors can alert you to water coming into the server. Temperature and humidity sensors can watch for dangerously high temperatures or some optimal moisture levels in the air. This can alert you to failed cooling in the server room. Smoke detectors and fire alarms are also critically important. evacuation procedures
Make sure you have a sound backup and a recovery system, along with a
good strategy in place.
If there a power outage for example, critical systems should fall back to battery power. But, battery backup power will only keep the systems on line for so long. To avoid potential data loss or damage, these systems should be
gracefully shut down before they completely lose power.
Risk assessment can involve brainstorming hypothetical scenarios and analyzing these events to understand
how they'd impact your organization and operations.
A disaster recovery plan
is a collection of documented procedures and plans on how to react and handle an emergency or disaster scenario, from the operational perspective. This includes things that should be done before, during and after a disaster.
The goal of the disaster recovery plan is to
minimize disruption to business and IT operations, by keeping downtime of systems to a minimum and preventing significant data loss.
Redunant power supply is designed to minimize the downtime that would be caused by
one power supply failing or a power outage.
Another super important preventive measure that should be evaluated and verified is
operational documentation
The specifics of building evacuation usually forward to the building management team, but as an IT support specialist, you'll likely work closely with some members of this team on things like,
power delivery, heating and cooling systems and building evacuation.
a disaster recovery plan will actually cover
preventive measures and detection measures on top of the post disaster recovery approach.
If it's something critical to permit operations, they should probably have a
redundant spare, just in case.
Anything critical to operations should be made
redundant whenever possible.
Your disaster plan should include
reference or links to documentation for these types of tasks. Anything and everything that would be required to restore normal operations following some disaster. This is where the steps for restoring various systems and data from backups should be.
lots of systems that support redundant power supplies also have a function to
send alerts on power loss events.
What types of operational procedures should be documented?
setting up and configuring critical systems and infrastructure. Any steps or specific configuration details that are needed to restore 100 percent functionality to core systems and services should be documented in detail.
You want to conduct regular disaster test to make sure
systems are functioning and that your procedures to handle them are also up to the task.
When you look into preventive measures, pay attention to
systems that lack redundancy.
To perform a risk assessment you need to
take a long hard look at the operations and characteristics of your teams.
If there's a fire and the building needs to be evacuated, you should be prepared to set up
temporary accommodation so people can still work effectively.
you also want to monitor for conditions that indicate
that a problem is likely to occur.
An effective way to check the documentation is periodically verify that the
that the steps documented actually work.
With disaster recovery the mechanisms chosen and procedures put in place will depend a lot on
the specifics of your organization and environment.
Timely notification of a disaster is critical, since some steps of the disaster recovery plan might be
time sensitive to ensure there is no data loss or equipment damage.
If up-time and availability is important for your organization, you'll likely have
two internet connections; a primary and a secondary. You want to monitor the connection status of both of these links. Ideally, they should be configured to automatically fail over if one goes down.
it's important to be prepared for a situation where typical documentation acts methods are
unavailable
It's also important that this documentation is kept
up to date
Detection measures
are meant to alert you and your team that a disaster has occurred that can impact operations.
corrective or recovery measures
are those enacted after disaster has occurred. These measures involve steps like restoring lost data from backups or rebuilding and reconfiguring systems that were damaged.
Why types of critical systems should you monitor?
you want to monitor conditions of service and infrastructure equipment. Things like temperatures, CPU load and network load for a service monitoring for error rates and requests per second will give you insight into the performance of the system. You should investigate any unusual spikes or unexpected increases.
Think through the impacts that a disaster affecting each of these aspects would have on
your day-to-day operations.