Business Continuity and Disaster Recovery Planning
Continuity of Operations Plan (COOP)
Provide procedures and capabilities to sustain an organization essentials, strategic functions at alternative site for up to 30 days
Business Resumption / Recovery Plan ( BRP)
Provide procedures for recovery business operations immediately following a disaster
Continuity of Support Plan
Provide procedures and capabilities for recovering a major application or general support system
Cyber Incident Response Plan
Provide strategies to detect , respond to and Limit consequences of malicious cyber incident
Mean Time to Repair (MTTR)
The Mean Time to Repair (MTTR) describes how long it will take to recover a specific failed system.
Categorization - NIST - ComputerSecurity Incident Handling Guide - SP 800-61
DoS - Malicious Code - Unauthorized access - Inappropriate usage - Multiple components
Minimum Operating Requirements (MOR)
Minimum Operating Requirements ( MOR ) describe the minimum environmental and connectivity requirements in order to operate computer equipment.
Mobile Site
Mobile sites are "data centers on wheels": towable trailers that contain racks of computer equipment, as well as HVAC, fire suppression, and physical security.
other terms may be substituted for Maximum Tolerable Downtime.
These include Maximum Allowable Downtime (MAD), Maximum Tolerable Outage (MTO), and Maximum Acceptable Outage (MAO).
BCP/DRP frameworks
800-34 Rev. 1 "Contingency Planning Guide for Federal Information Systems" ISO/IEC 27031 focuses on BCP Separate ISO plan for disaster recovery is "ISO/IEC 24762:2008, Information technology—Security techniques—Guidelines for information and communications technology disaster recovery services."
Cold Site
A cold site is the least expensive recovery solution to implement. It does not include backup copies of data nor does it contain any immediately available hardware.
Develop the contingency planning policy statement
A formal department or agency policy provides the authority and guidance necessary to develop an effective contingency plan.
Hot Site
A hot site is a location that an organization may relocate to following a major disruption or disaster. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers.
Call Tree
A key tool leveraged for staff communication by the Crisis Communications Plan is the Call Tree, which is used to quickly communicate news throughout an organization without overburdening any specific person.
Failure and recovery metrics
A number of metrics are used to quantify how frequently systems fail, how long a system may exist in a failed state, and the maximum time to recover from failure. These metrics include the Recovery Point Objective ( RPO ), Recovery Time Objective (RTO), Work Recovery Time (WRT), Mean Time Between Failures (MTBF), Mean Time to Repair (MTTR), and Minimum Operating Requirements ( MOR ).
redundant site
A redundant site is an exact production duplicate of a system that has the capability to seamlessly operate all necessary IT operations without loss of services to the end user of the system.
simulation test , also called a walk through dril
A simulation test , also called a walk through drill (not to be confused with the discussion-based structured walk-through), goes beyond talking about the process and actually has teams to carry out the recovery process.
Activate Team
Activate team If a disaster is declared, then the recovery team needs to be activated. Depending on the scope of the disaster, this communication could prove extremely difficult. The use of calling trees,
Assess
Assess Though an initial assessment was carried out during the initial response portion of the disaster recovery process, a more detailed and thorough assessment will be performed by the disaster recovery team.
Assessing the critical state
Assessing the critical state can be difficult because determining which pieces of the IT infrastructure are critical depends solely on how it supports the users within the organization.
Check List - Consistency Testing
Checklist (also known as consistency ) testing lists all necessary components required for successful recovery and ensures that they are, or will be, readily available should a disaster occur.
Communicate
Communicate One of the most difficult aspects of disaster recovery is ensuring that consistent timely status updates are communicated back to the central team managing the response and recovery process.
Conduct the business impact analysis BIA
Conduct the business impact analysis ( BIA ): The BIA helps to identify and prioritize critical IT systems and components. A template for developing the BIA is also provided to assist the user.
Develop an IT contingency plan
Develop an IT contingency plan : The contingency plan should contain detailed guidance and procedures for restoring a damaged system.
Develop recovery strategy
Develop recovery strategies : Thorough recovery strategies ensure that the system may be recovered quickly and effectively following a disruption.
Environmental
Environmental—Threats focused on information systems or datacenter environments include items such as power issues (blackout, brownout, surge, spike), system component or other equipment failures, and application or software flaws.
Types of disruptive events
Errors and omissions:Natural disasters:Electrical or power problems:Temperature and humidityWarfare, terrorism, and sabotage:Financially motivated attackers:Personnel shortages:
Identify preventive controls
Identify preventive controls : Measures taken to reduce the effects of system disruptions can increase system availability and reduce contingency life cycle costs.
Natural
Natural—The most obvious type of threat that can result in a disaster is naturally occurring. This category includes threats such as earthquakes, hurricanes, tornadoes, floods, and some types of fires. Historically, natural disasters have provided some of the most devastating disasters that an organization can have to respond to.
Partial and complete business interruption
Partial and complete business interruption Arguably, the most high fidelity of all DRP tests involves business interruption testing . However, this type of test can actually be the cause of a disaster, so extreme caution should be exercised before attempting an actual interruption test.
Plan maintenance
Plan maintenance : The plan should be a living document that is updated regularly to remain current with system enhancements."
Plan testing , training , and exercis
Plan testing , training , and exercises : Testing the plan identifies planning gaps, whereas training prepares recovery personnel for plan activation; both activities improve plan effectiveness and overall agency preparedness.
Incident Response Plan - IRP -6 Phases
Preparation - Identification - Containment - Eradication - Recovery - Lessons Learned
Project Initiation
Project Initiation involves seven distinct milestones
Occupant Emergency Plan (OEP)
Provide coordinated procedures to minimize loss of life or injury protecting property damage in response to physical attack
Disaster Recovery Plan (DRP)
Provide detail procedures to facilitate recovery of capabilities as an alternate site
Crisis Management Plan ( CMP )
Provide procedure to disseminating status reports to personal and the public
Business Continuity Plan ( BCP)
Provide procedures for sustaining essential business operations while recovering from significant disruption
Reconstitution
Reconstitution The primary goal of the reconstitution phase is to successfully recover critical business operations at either primary or secondary site.
Respond
Respond In order to begin the disaster recovery process, there must be an initial response that begins the process of assessing the damage. Speed is essential during this initial assessment.
The Disaster Recovery Process
Respond, Activate Team, Communicate, Assess, Reconstitute
Starting emergency power
Starting emergency power Though it might seem simple, converting a datacenter to emergency power, such as backup generators that will begin taking the load as the UPS fail, is not to be taken lightly.
BCP/DRP-focused risk assessmen
The BCP/DRP-focused risk assessment determines what risks are inherent to which IT assets. A vulnerability analysis is also conducted for each IT system and major application.
Relationship between BCP and DRP
The Business Continuity Plan is an umbrella plan that includes multiple specific plans, most importantly the Disaster Recovery Plan. The Disaster Recovery Plan serves as a subset of the overall Business Continuity Plan
DRP Review
The DRP review is the most basic form of initial DRP testing and is focused on simply reading the DRP in its entirety to ensure completeness of coverage.
Disaster Recovery Planning
The Disaster Recovery Plan (DRP) provides a short-term plan for dealing with specific IT-oriented disruptions.
Recovery Point Objective RPO
The Recovery Point Objective (RPO) is the amount of data loss or system inaccessibility (measured in time) that an organization can withstand. The point prior to the outage to which data are to be restored that is the last point of know good data
Recovery Time Objective RTO
The Recovery Time Objective (RTO) describes the maximum time allowed to recover business or IT systems. RTO is also called the systems recovery time.
Service Delivery Objectives - SDO
The SDO is the level of acceptable service that me be achieved with the Recovery Time Objective - RTO
Identify critical assets
The critical asset list is a list of those IT assets that are deemed business essential by the organization.
Determine Maximum Tolerable Downtime
The primary goal of the BIA is to determine the Maximum Tolerable Downtime ( MTD ), which describes the total time a system can be inoperable before an organization is severely impacted.
NIST Special Publication 800 34
provides a visual means for understanding the interrelatedness of a BCP and a DRP, as well as Continuity of Operations Plan ( COOP ), Occupant Emergency Plan ( OEP ), and others.
Maximum Tolerable Downtime is composed of two metrics:
the Recovery Time Objective ( RTO ) and the Work Recovery Time ( WRT )
Warm Site
A warm site has some aspects of a hot site, for example, readily accessible hardware and connectivity, but it will have to rely upon backup data in order to reconstitute a system after a disruption. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers.
AIW
Acceptable Interruption Window
Business Impact Analysis - BIA
Business Impact Analysis ( BIA ) is the formal method for determining how a disruption to the IT system(s) of an organization will impact the organization's requirements, processes, and interdependencies with respect to the business mission.
Change Mangement
Change management includes tracking and documenting all planned changes, formal approval for substantial changes, and documentation of the results of the completed change. All changes must be auditable.
Human
Human—The human category of threats represents the most common source of disasters. Human threats can be further classified by whether they constitute an intentional or unintentional threat.
IPF
Information process facility
System Development Lifecycle - SDLC
Initiation - Development/ Acquisition - Implementation - Operate/ Maintenance - End of Life / Disposition
Maximum Tolerable Outage - MTO
MTO is the total time the operations can subs tainted at an alternate site
Mean Time Between Failures
Mean Time Between Failures (MTBF) quantifies how long a new or repaired system will run before failing.
Defining Incident Management Processes -CMU/SEI
Prepare - Protect - Detect - Triage - Respond
Business Continuity Planning
The overarching goal of a BCP is for ensuring that the business will continue to operate before, throughout, and after a disaster event is experienced.
Downtime consists of two elements
The systems recovery time and the work recovery time. Therefore, MTD = RTO + WRT.
Disasters or disruptive events
The three common ways of categorizing the causes for disasters are whether the threat agent is natural, human, or environmental in nature.
Work Recovery Time (WRT)
Work Recovery Time (WRT) describes the time required to configure a recovered system.
CSIRTs
computer security incident response teams - CSIRTs
Parrellel Processing
parallel processing . This type of test is common in environments where transactional data is a key component of the critical business processing. Typically, this test will involve recovery of critical processing components at an alternate computing facility
NIST 800-34, Contingency Planning Guide to achieving a sound, logical BCP/DRP.
• Project Initiation • Scope the Project • Business Impact Analysis • Identify Preventive Controls • Recovery Strategy • Plan Design and Development • Implementation, Training, and Testing • BCP/DRP Maintenance