ISMV4 Module 10, 11, & 12
Local Replication: VM Snapshot
A VM snapshot preserves the state and data of a VM at a specific PIT - The state includes the VM's power state, for example: powered-on, powered-off, or suspended The data includes all the files that make up the VM - This includes disks, memory, and other devices, such as virtual network interface cards - This VM snapshot is useful for quick restore of a VM
Causes of Information Unavailability
Application failure (for example: due to catastrophic exceptions caused by bad logic) Data loss Infrastructure component failure (for example: due to power failure or disaster) Data center or site down For example: due to power failure or disaster Refreshing IT infrastructure
Graceful Degradation
Application maintains limited functionality even when some of the modules or supporting services are not available Unavailability of certain application components or modules should not bring down the entire application
Persistent Application State Model
Application state information is stored out of the memory Stored in a data repository If an instance fails, the state information is still available in the repository
Resilient Application Overview
Applications have to be designed to deal with IT resource's failure to guarantee the required availability Fault resilient applications have logic to detect and handle transient fault conditions to avoid application downtime Examples of key application design strategies for improving availability: Graceful degradation of application functionality Retry logic in application code Persistent application state model
Dynamic Disk Sparing
Automatically replaces a failed drive with a spare drive to protect against data loss Multiple spare drives can be configured to improve availability
Business Continuity Traits
BC process enables continuous availability of information and services in the event of failure to meet the required SLA BC involves various proactive and reactive countermeasures It is important to automate BC process to reduce the manual intervention Goal of BC solution is to ensure information availability
Key backup components are:
Backup client Backup server Storage node Backup device (backup target)
Local Replication: VM Snapshot Example
Child virtual disks store all the changes that are made to the parent VM after snapshots are created When committing snapshot 3, the data on child virtual disk file 1 and 2 are committed prior to committing data on child virtual disk 3 to the parent virtual disk file After committing the data, the child virtual disk 1, 2, and 3 are deleted However, while rolling back to the snapshot 1, child disk file 1 is retained and the snapshots 2 and 3 are discarded
VM clone
Clone is a copy of an existing virtual machine (parent VM) o The clone VM's MAC address is different from the parent VM Typically clones are deployed when many identical VMs are required o Reduces the time that is required to deploy a new VM
Local Replication: Clone
Cloning provides the ability to create fully populated point-in-time copies of LUNs within a storage system or create a copy of an existing VM Clone of a storage volume Initial synchronization is performed between the source LUN and the replica (clone) Changes made to both the source and the replica can be tracked at some predefined granularity
BC - Analyze
Collect information on data profiles, business processes, infrastructure support, dependencies, and frequency of using business infrastructure Conduct a business impact analysis Identify critical business processes and assign recovery priorities Perform risk analysis for critical functions and create mitigation strategies Perform cost benefit analysis for available solutions based on the mitigation strategy Evaluate options
Link Aggregation
Combines links between two switches and also between a switch and a node Enables network traffic failover in the event of a link failure in the aggregation
Journal Volume
Contains all the data that has changed from the time the replication session started to the production volume
Remote Replication: Multisite
Data from source site is replicated to multiple remote sites for DR purpose Disaster recovery protection is always available if any one-site failure occurs Mitigates the risk in two-site replication No DR protection after source or remote site failure
BC - Design & Develop
Define the team structure and assign individual roles and responsibilities; for example, different teams are formed for activities such as emergency response and infrastructure and application recovery Design data protection strategies and develop infrastructure Develop contingency solution and emergency response procedures Detail recovery and restart procedures
BC - Establish Objectives
Determine BC requirements Estimate the scope and budget to achieve requirements Select a BC team that includes subject matter experts from all areas of business, whether internal or external Create BC policies
Virtual Tape Library
Disks are emulated and presented as tapes to backup software. Does not require any additional modules or changes in the legacy backup software Provides better performance and reliability over physical tape Does not require the usual maintenance tasks that are associated with a physical tape drive, such as periodic cleaning and drive calibration
Multipathing
Enables a compute system to use multiple paths for transferring data to a LUN Enables failover by redirecting I/O from a failed path to another active path Performs load balancing by distributing I/O across active paths
Elastic Load Balancing
Enables dynamic distribution of application and client I/O traffic Dynamically scales resources (VM instances) to meet traffic demands Provides fault tolerance capability by detecting the unhealthy VM instances and automatically redirects the I/Os to other healthy VM instances
Disk Library
Enhanced backup and recovery performance No inherent offsite capability Disk-based backup appliance includes features such as deduplication, compression, encryption, and replication to support business objectives
Dell EMC PowerPath
Host-based multipathing software Provides path failover and load-balancing functionality Automatic detection and recovery from host-to-array path failures PowerPath/VE software enables optimizing virtual environments with PowerPath multipathing features
BC - Implement
Implement risk management and mitigation procedures that include backup, replication, and management of resources Prepare the DR sites that can be utilized if a disaster affects the primary data center. The DR site could be one of the organization's own data center or could be a cloud Implement redundancy for every resource in a data center to avoid single points of failure
CDP Appliance
Intelligent hardware platform that runs the CDP software Manages both the local and the remote replications Appliance could also be virtual, where CDP software is running inside VMs
Write Splitter
Intercept writes to the production volume from the compute system and splits each write into two copies Can be implemented at the compute, fabric, or storage system
Hypervisor-based CDP
Protects a single or multiple VMs locally or remotely Enables to restore VM to any PIT Virtual appliance is running on a hypervisor Write splitter is embedded in the hypervisor
VMware FT
Provides continuous availability for application in the event of server failure Creates a live shadow instance of a VM that is in virtual lockstep with the primary instance FT eliminates even the smallest chance of data loss or disruption
Fault Detection and Retry Logic
Refers to a mechanism that implements a logic in the code of an application to improve the availability To detect and retry the service that is temporarily down; may result in successful restore of service
Tape Library
Tapes are portable and can be used for long term offsite storage. Must be stored in locations with a controlled environment Not optimized to recognize duplicate content Data integrity and recoverability are major issues with tape-based backup media.
Train, Test, Assess, and Maintain
Train the employees who are responsible for backup and replication of business-critical data on a regular basis or whenever there is a modification in the BC plan Train employees on emergency response procedures when disasters are declared Train the recovery team on recovery procedures based on contingency scenarios Perform damage-assessment processes and review recovery plans Test the BC plan regularly to evaluate its performance and identify its limitations Assess the performance reports and identify limitations Update the BC plans and recovery/restart procedures to reflect regular changes within the data center
Compute Clustering
Two or more compute systems/hypervisors are clustered to provide high availability and load balancing Service running on a failed compute system moves to another compute system Two common clustering implementations are: Active/active Active/passive
Storage Virtualization
Virtual volume is created using virtualization appliance Each I/O to the volume is mirrored to the LUNs on the storage systems Virtual volume is continuously available to compute system Even if one of the storage systems is unavailable due to failure
Remote Replication: Synchronous
Write is committed to both the source and the remote replica before it is acknowledged to the compute system Enables to restart business operations at a remote site with zero data loss; Provides near zero RPO
An availability zone is
A location with its own set of resources and isolated from other zones.A zone can be an entire data center or a part of the data center Enables running multiple service instances within and across zones to survive data center or site failure If there is an outage, the service should seamlessly failover across the zones
Definition: Disaster Recovery (DR)
A part of BC process, which involves a set of policies and procedures for restoring IT infrastructure, including data that is required to support ongoing IT services, after a natural or human-induced disaster occurs.
Definition: Data Replication
A process of creating an exact copy (replica) of the data to ensure business continuity in the event of a local outage or disaster. Replicas are used to restore and restart operations if data loss occurs Data can be replicated to one or more locations based on the business requirements
Network Fault Tolerance Mechanisms
A short-time network interruption could impact plenty of services running in a data center environment. So, the network infrastructure must be fully redundant and highly available with no single points of failure.
Definition: Recovery-in-place
A term that refers to running a VM directly from the backup device, using a backed up copy of the VM image instead of restoring that image file. Eliminates the need to transfer the image from the backup device to the primary storage before it is restarted Provides an almost instant recovery of a failed VM Requires a random access device to work efficiently Disk-based backup target Reduces the RTO and network bandwidth to restore VM files
Remote Replication: Asynchronous
A write is committed to the source and immediately acknowledged to the compute system: Data is buffered at the source and sent to the remote site periodically Applications write response time is not dependent on the latency of the link Replica is behind the source by a finite amount (finite RPO)
Definition: Fault Tolerance
Ability of an IT system to continue functioning in the event of a failure.
Optimize Load Balancing:
Adjust I/O paths to dynamically rebalance your application environment for peak performance
Impact of Information Unavailability
An IT service outage, due to information unavailability, results in loss of productivity, loss of revenue, poor financial performance, and damages to reputation.
Definition: Backup
An additional copy of production data, which is created and retained for the sole purpose of recovering lost or corrupted data.
Definition: NDMP
An open standard TCP/IP-based protocol that is designed for a backup in a NAS environment. Data can be backed up using NDMP regardless of the operating system or platform Backup data is sent directly from NAS to the backup device No longer necessary to transport data through application servers Backs up and restores data while preserving security attributes of file system (NFS and CIFS) and maintains data integrity
Data Migration
Another use for a replica is data migration. Data migrations are performed for various reasons such as migrating from a smaller capacity LUN to one of a larger capacity.
Recovery Operation
BBB (only need to know the first 3 B's) DSB (1) Backup client requests backup server for data restore (2) Backup server scans backup catalog to identify data to be restored and the client that will receive data (3) Backup server instructs storage node to load backup media in the backup device (4) Data is then read and sent to the backup client (5) Storage node sends restore metadata to the backup server (6) Backup server updates the backup catalog
BC vs Disaster Recovery
BC is before (ensuring uptime) Disaster Recovery (steps after to recover)
BC Planning Lifecycle
BC planning must follow a disciplined approach like any other planning process. Organizations today dedicate specialized resources to develop and maintain BC plans. From the conceptualization to the realization of the BC plan, a lifecycle of activities can be defined for the BC process. The BC planning lifecycle includes five stages:
Replicas are created for various purposes which include the following:
Can act as a source for backup Can be used to restart business operations or to recover the data Used for running decision support activities Used for testing applications Data migration
heartbeat
Clustering uses a heartbeat mechanism to determine the health of each node in the cluster. The exchange of heartbeat signals, usually happens over a private network enables participating cluster members to monitor one another's status.
Continuous Near-zero RPO
Consistency Ensures the usability of a replica Replica must be consistent with the source
Cumulative Backup:
Cumulative Backup: It copies the data that has changed since the last full backup.
Storage Fault Tolerance Mechanisms
Data centers comprise storage systems with a large number of disk drives, and solid state drives. This storage systems support various applications and services running in the environment.
Automate Failover/Recovery
Define failover and recovery rules that route application requests to alternative resources in the event of component failures or user errors
Backup Granularity
Different granularity levels are: Full backup Incremental backup Cumulative backup
Cloud-Based Backup: Backup as a Service
Enables consumers to procure backup services on demand through a self-service portal Backup and Recovery Lesson Information Storage and Management (ISM) v4 Provides the capability to perform backup and recovery at any time, from anywhere Reduces the backup management overhead Transforms from CAPEX to OPEX Pay-per-use/subscription-based pricing Enables organizations to meet long-term retention requirements Backing up to cloud ensures regular and automated backup of data Gives consumers the flexibility to select a backup technology based on their current requirements
Restartability
Enables restarting business operations using the replicas.
Recoverability
Enables restoration of data from the replicas to the source if data loss occurs.
Erasure Coding
Erasure Coding: Provides space-optimal data redundancy to protect data loss against multiple drive failure
Fault Isolation
Fault isolation limits the scope of a fault into local area so that the other areas of a system are not impacted by the fault. It does not prevent failure of a component but ensures that the failure does not impact the overall system.
Fast Recovery and Restart
For critical applications, replicas can be taken at short, regular intervals. This enables fast recovery from data loss. If a complete failure of the source LUN occurs, the replication solution enables to restart the production operation on the replica. This approach reduces the RTO.
NIC Teaming
Groups NICs so that they appear as a single, logical NIC to the operation system or hypervisor Provides network traffic failover in the event of a NIC/link failure Distributes network traffic across NICs
Importance of Business Continuity
HAD High-risk Data Application Dependency Data Protection Laws
IA = Calculate
IA = Uptime / (Uptime + Downtime)
Image-Based Backup
Image-based backup makes a copy of the virtual drive and configuration that are associated with a particular VM. Backup is saved as a single entity called a VM image Enables quick restoration of a VM Supports recovery at VM-level and file-level No agent is required inside the VM to perform backup Backup processing is offloaded from VMs to a proxy server
Agent-Based Backup
In this approach, an agent or client is installed on a virtual machine or a physical compute system. The agent streams the backup data to the backup device as shown in the illustration.
Measurement of Information Availability
Information availability relies on the availability of both physical and virtual components of a data center.
Incremental Backup:
It copies the data that has changed since the last backup.
MTBF: How do you calculate?
MTBF = Total uptime / Number of failures
MTTR: Calculate MTTR
MTTR = Total downtime / Number of failures
Compute Cluster Example
Multiple hypervisors running on different systems are clustered. Provides continuous availability of services running on VMs
Continuous Data Protection (CDP)
Network-based replication solution Provides the ability to restore data and VMs to any previous PIT Supports heterogeneous compute and storage platforms Supports both local and remote replication Data can also be replicated to more than two sites (multisite) Supports WAN optimization techniques to reduce bandwidth requirements
Backup Operation
ON CERT: B B B B B S S B - Drag and Drop (1) Backup server initiates scheduled backup process. (2) Backup server retrieves backup-related information from the backup catalog. (3a) Backup server instructs storage node to load backup media in the backup device. (3b) Backup server instructs backup clients to send data to be backed up to the storage node. (4) Backup clients send data to storage node and update the backup catalog on the backup server. (5) Storage node sends data to the backup device (6) Storage node sends metadata and media information to the backup server (7) Backup server updates the backup catalog
Standardize Path Management:
Optimize I/O paths in physical and virtual environments (PowerPath/VE) and cloud deployments
Implementing Redundancy at Component-Level
Organizations should follow stringent guidelines to implement fault tolerance in their data centers for uninterrupted services. The underlying IT infrastructure components (compute, storage, and network) should be highly available and the single points of failure at the component level should be avoided.
Recovery Point Objectives (RPO)
Point-in-time to which data must be recovered. (How much data loss)
Definition: Business Continuity (BC)
Process that prepares for, responds to, and recovers from a system outage that can adversely affect business operations.
VMware HA
Provides high availability for applications running in virtual machines If there is a fault in a physical compute system, then the affected VMs are automatically restarted on other compute systems
Point-in-Time (PIT) Nonzero RPO
Recoverability/Restartability Replica could restore data to the source device Restart business operation from replica
Local Replication: Storage System-Based Snapshot - RoW
Redirects new writes that are destined for the source LUN to a reserved LUN in the storage pool Replica (snapshot) still points to the source LUN All reads from replica are served from the source LUN
Definition: Single Point of Failure
Refers to any individual component or aspect of an infrastructure whose failure can make the entire system or service unavailable.
Remote Replication
Refers to replicating data to remote locations (locations can be geographically dispersed) Data can be synchronously or asynchronously replicated Helps to mitigate the risks associated with regional outages Enables organizations to replicate the data to cloud for DR purpose
Local Replication
Refers to replicating data within the same location. Within a data center in compute-based replication. Within a storage system in storage system-based replication. Typically used for operational restore of data if there is a data loss.
Information Availability can be defined in terms of:
Reliability Timeliness
Consistency
Replica must be consistent with the source so that it is usable for both recovery and restart operations.
Testing Platform
Replicas are also used for testing new applications or upgrades.
Decision-Support Activities
Running reports using the data on the replicas greatly reduces the I/O burden on the production device.
Definition: Information Availability (IA)
The ability of an IT infrastructure to function according to business requirements and customer expectations, during its specified time of operation.
PIT replica
The data on the replica is an identical image of the production at some specific timestamp.
Continuous replica
The data on the replica is in-sync with the production data always. The objective with any continuous replication is to reduce the RPO to zero or near-zero.
Primary Storage-Based Backup
This backup approach backs up data directly from primary storage system to backup target without requiring additional backup software. This backup approach backs up data directly from primary storage system to backup target without requiring additional backup software. Eliminates the backup impact on application servers Improves the backup and recovery performance to meet SLAs
Recovery Time Objectives (RTO)
Time within which systems and applications must be recovered. (How fast is recovery)
MTTR =
Total downtime / Number of failures
MTBF =
Total uptime / Number of failures
Fault tolerance protects an IT system or a service against the following types of unavailability:
Transient unavailability: It occurs once for short time and then disappears. For example, an online transaction times out but works fine when a user retries the operation. Intermittent unavailability: It is a recurring unavailability that is characterized by an outage and then availability again and then another outage, and so on. Permanent unavailability: It exists until the faulty component is repaired or replaced. Examples of permanent unavailability are network link outage, application issues, and manufacturing defects.
Alternative Source for Backup
Under normal backup operations, data is read from the production LUNs and written to the backup device. This places an extra burden on the production infrastructure because production LUNs are simultaneously involved in production operations and servicing data for backup operations
Information Availability (I/A)
Uptime / (Uptime + Downtime)
Eliminating Single Points of Failure
avoided by implementing fault tolerance mechanisms such as redundancy Implement redundancy at component level Compute Network Storage Implement multiple availability zones Avoid single points of failure at data center (site) level It is important to have high availability mechanisms that enable automated application/service failover
In active/active clustering
the nodes in a cluster are all active participants and run the same service of their clients. The active/active cluster balances requests for service among the nodes. If one of the nodes fails, the surviving nodes take the load of the failed one. This method enhances both the performance and the availability of a service.
In active/passive clustering,
the service runs on one or more nodes and the passive node waits for a failover. If the active node fails, the service that had been running on the active node is failed over to the passive node. Active/passive clustering does not provide performance improvement like active/active clustering.