Module 11,12,13 (study this one) Business continuity *come back after

Ace your homework & exams now with Quizwiz!

Backup Service Deployment Options

- Managed Backup Service - Remote Backup Service - Replicated Backup Service There are three common backup service deployment options in a cloud-based backup. • Local backup service (managed backup service): This option is suitable when a cloud service provider is already providing some form of cloud services (example: compute services, SaaS) to the consumers. The service provider may choose to offer backup services to the consumers, helping protect consumer's data that is being hosted in the cloud. In this approach, the backup operation is completely managed by the service provider. • Remote backup service: In this option, consumers do not perform any backup at their local site. Instead, their data is transferred over a network to a backup infrastructure managed by the cloud service provider. To perform backup to the cloud, typically the cloud backup agent software is installed on the servers that need to be backed up. After installation, this software establishes a connection between the server and the cloud where the data will be stored. The backup data transferred between the server and the cloud is typically encrypted to make the data unreadable to an unauthorized person or system. Deduplication can also be implemented to reduce the amount of data to be sent over the network (bandwidth reduction) and reduce the cost of backup storage. • Replicated backup service: This is an option where a consumer performs backup at their local site but does not want to either own or manage or incur the expense of a remote site for disaster recovery purposes. For such consumers, they choose replicated backup service, where the backup data in their site is replicated to the cloud (remote disaster recovery site). Note: Cloud-to-Cloud Backup It allows consumers to backup cloud-hosted applications data (SaaS applications) to other cloud.

factors affecting deduplication ratio

- Retention Period - Frequency of full backup - Change rate - Data type - Deduplication method Retention period: This is the period of time that defines how long the backup copies are retained. The longer the retention, the greater is the chance of identical data existence in the backup set which would increase the deduplication ratio and storage space savings. • Frequency of full backup: As more full backups are performed, it increases the amount of same data being repeatedly backed up. So, it results in high deduplication ratio. • Change rate: This is the rate at which the data received from the backup application changes from backup to backup. Client data with a few changes between backups produces higher deduplication ratios. • Data type: Backups of user data such as text documents, PowerPoint presentations, spreadsheets, and e-mails are known to contain redundant data and are good deduplication candidates. Other data such as audio, video, and scanned images are highly unique and typically do not yield good deduplication ratio. • Deduplication method: Deduplication method also determines the effective deduplication ratio. Variable‐length, sub‐file deduplication (discussed later) discover the highest amount of deduplication of data.

RTO - Recovery Time Objective

- Time within which systems and applications must be recovered after an outage - amount of downtime that a business can endure and survive Recovery Time Objective (RTO): This is the time within which systems and applications must be recovered after an outage. It defines the amount of downtime that a business can endure and survive. For example, if the RTO is a few seconds, then implementing global clustering would help to achieve the required RTO. The more critical the application, the lower the RTO should be. Both RPO and RTO are counted in minutes, hours, or days and are directly related to the criticality of the IT service and data. The lower the number of RTO and RPO, the higher will be the cost of a BC solution.

BC Planning Lifecycle

1. Establishing Objectives 2. Analyzing 3. Designing and developing 4. Implementing 5. Training, testing, assessing, and maintaining BC planning must follow a disciplined approach like any other planning process. Organizations today dedicate specialized resources to develop and maintain BC plans. From the conceptualization to the realization of the BC plan, a lifecycle of activities can be defined for the BC process. The BC planning lifecycle includes five stages: 1. Establishing objectives • Determine BC requirements • Estimate the scope and budget to achieve requirements • Select a BC team that includes subject matter experts from all areas of business, whether internal or external • Create BC policies 2. Analyzing • Collect information on data profiles, business processes, infrastructure support, dependencies, and frequency of using business infrastructure • Conduct a business impact analysis • Identify critical business processes and assign recovery priorities • Perform risk analysis for critical functions and create mitigation strategies • Perform cost benefit analysis for available solutions based on the mitigation strategy • Evaluate options (Cont'd) Module 12: Introduction to Business Continuity Copyright 2015 EMC Corporation. All rights reserved. 12 3. Designing and developing • Define the team structure and assign individual roles and responsibilities; for example, different teams are formed for activities such as emergency response and infrastructure and application recovery • Design data protection strategies and develop infrastructure • Develop contingency solution and emergency response procedures • Detail recovery and restart procedures 4. Implementing • Implement risk management and mitigation procedures that include backup, replication, and management of resources • Prepare the DR sites that can be utilized if a disaster affects the primary data center. The DR site could be one of the organization's own data center or could be a cloud • Implement redundancy for every resource in a data center to avoid single points of failure 5. Training, testing, assessing, and maintaining • Train the employees who are responsible for backup and replication of business-critical data on a regular basis or whenever there is a modification in the BC plan • Train employees on emergency response procedures when disasters are declared • Train the recovery team on recovery procedures based on contingency scenarios • Perform damage-assessment processes and review recovery plans • Test the BC plan regularly to evaluate its performance and identify its limitations • Assess the performance reports and identify limitations • Update

vSphere Data Protection

A backup and recovery solution designed for vSphere environments. It provides agent-less, image-level virtual machine backups to disk. It also provides application-aware protection for business-critical Microsoft applications (Exchange, SQL Server, SharePoint) along with WAN-efficient, encrypted backup data replication. It is fully integrated with vCenter Server and vSphere Web Client.

Disaster Recovery

A part of BC process, which involves a set of policies and procedures for restoring IT infrastructure, including data that is required to support ongoing IT services, after a natural or human-induced disaster occurs. Disaster recovery (DR) is a part of BC process which involves a set of policies and procedures for restoring IT infrastructure, including data that is required to support the ongoing IT services, after a natural or human-induced disaster occurs. Disaster Recovery Plans (DRP) are generally part of a larger, more extensive practice known as Business Continuity Planning. DR plans should be well practiced so that the key people become familiar with the specific actions they will need to take when a disaster occurs. DR plans must also be adaptable and routinely updated, e.g. if some new people, a new branch office, or some new hardware or software are added to an organization, they should promptly be incorporated into the organization's disaster recovery plan. The companies must consider all these facets of their organization as well as update and practice their plan if they want to maximize their recovery after a disaster. The basic underlying concept of DR is to have a secondary data center or site (DR site) and at a pre-planned level of operational readiness when an outage happens at the primary data center. Typically in a DR process, a previous copy of the data is restored and logs are applied to that copy to bring it to a known point of consistency. After all recovery efforts are completed, the data is validated to ensure that it is correct. The disaster recovery methods often require buying and maintaining a complete set of IT resources at secondary data centers that matches the business-critical systems at the primary data center. This includes sufficient storage to house a complete copy of all of the enterprise's business data by regularly copying production data on the mirror systems at secondary site. This may be a complex process and expensive solution for a significant number of organizations. Disaster Recovery-as-a-Service (DRaaS) has emerged as a solution to strengthen the portfolio of a cloud service provider, while offering a viable DR solution to consumer organizations. Having DR sites in the cloud reduces the need for data center space, IT infrastructure, and IT resources, which lead to significant cost reductions to organizations. DRaaS is further discussed in module 14, 'Replication'.

remote backup service

A service that provides users with an online system for backing up and storing computer files. Remote backup has several advantages over traditional backup methodologies: the task of creating and maintaining backup files is removed from the IT department's responsibilities; the backups are maintained off site; some services can operate continuously, backing up each transaction as it occurs.

recovery-in-place

A term that refers to running a VM directly from the backup device, using a backed up copy of the VM image instead of restoring that image file. eliminates the need to transfer the image from the backup device to the primary storage before it is restarted -provides an almost instant recovery of a filed VM Requires a random access device in order to work efficiently -disk-based backup target reduces the RTO and network bandwidth to restore vm files Recovery-in-place (Instant VM recovery) is a term that refers to running a VM directly from the backup device, using a backed up copy of the VM image instead of restoring that image file. One of the primary benefits of recovery in place is that it eliminates the need to transfer the image from the backup area to the primary storage area before it is restarted, so the application that are running on those VMs can be accessed more quickly. This not only saves time for recovery, but also reduces network bandwidth to restore files. When a VM is recovered in place it is dependent on the storage I/O performance of the actual disk target (disk backup appliance).

graceful degradation

Applications maintains the limited functionality even when some of the modules or supporting services are not available Unavailability of certain application components or modules should not bring down the entire applications

Data Archiving Solution Architecture

Archiving agent archiving server(policy engine) archiving storage device Archiving solution architecture consists of three key components: archiving agent, archiving server, and archiving storage device. An archiving agent is a software installed on the application servers (example: File servers and E-mail servers). The agent is responsible for scanning the data and archiving it, based on the policy defined on the archiving server (policy engine). After the data is identified for archiving, the data will be moved to the archiving storage device. From a client perspective, this movement is completely transparent. Then, the original data on the primary storage is replaced with a stub file. The stub file contains the address of the archived data. The size of this file is small and significantly saves space on primary storage. When the client is trying to access the files from the application servers, the stub file is used to retrieve the file from the archive storage device. An archiving server is software installed on a server that enables administrators to configure the policies for archiving data. Policies can be defined based on file size, file type, or creation/modification/access time. Once the data is identified for archiving, the archiving server creates the index for the data to be moved. By utilizing the index, users may also search and retrieve their data with the web search tool. Note: Converge Backup and Archive Storage Backup is driven by the need for recoverability and disaster protection while archiving is driven by the need for improved efficiency and to address compliance challenges. Real cost savings can be realized by adopting a strategy for the physical storage of both backup and archiving workloads. To accomplish this, a common storage target must be able to handle the throughput and inline deduplication requirements of backup workloads and secure and long-term retention requirements of archive workloads. In addition, the storage target should provide built-in capabilities for network-efficient replication for disaster recovery needs, enterprise features such as encryption, and allow for easy integration with existing application infrastructure. By leveraging a common infrastructure for both, organizations can greatly ease the burden of eDiscovery, data recovery, business continuity, and compliance and achieve these goals in the most cost-efficient manner.

Direct primary storage backup approach

Backs up data directly from Primary storage to backup target without backup software. -eliminates the backup impact on application servers -improves the backup and recovery performances to meet SLAs This backup approach backs up data directly from primary storage system to a backup target without requiring additional backup software. Typically, an agent runs on the application servers that control the backup process. This agent stores configuration data for mapping the LUNs on the primary storage system to the backup device in order to orchestrate backup (the transfer of changed blocks and creation of backup images) and recovery operations. This backup information (metadata) is stored in a catalog which is local to the application server. When a backup is triggered through the agent running on application server, the application momentarily pauses simply to mark the point in time for that backup. The data blocks that have changed since the last backup will be sent across the network to the backup device. The direct movement from primary storage to backup device eliminates the LAN impact by isolating all backup traffic to the SAN. This approach eliminates backup impact on application servers and provides faster backup and recovery to meet the application protection SLAs. For data recovery, the backup administrator triggers recovery operation; then the primary storage reads the backup image from the backup device. The primary storage replaces production LUN with the recovered copy.

ProtectPoint (EMC)

Backs up directly from primary storage (VMAX). Eliminates backup impact on app server. block tracking technology. eliminates the backup impact on application server leverages primary storage change block tracking tehnology

NDMP 2-Way (Direct Method)

Backup server instructs the NAS head to start the backup. Data is backed up from storage and sent directly to backup device. Network traffic is minimized.

Sub-file level deduplication

Breaks files into smaller segments. Detects redundant data within small files. Two Methods- Fixed-length block, Variable-Length block

EMC Networker

Centralizes, automates, and accelerates data backup and recovery operations across the enterprise. supports multiplexing supports source-based and target-based deduplication capabilities by integrating with EMC avamar and emc data domain respectively

Cumulative (differential) backup

Cumulative (differential) backup: It copies the data that has changed since the last full backup. Suppose for example the administrator wants to create a full backup on Monday and differential backups for the rest of the week. Tuesday's backup would contain all of the data that has changed since Monday. It would therefore be identical to an incremental backup at this point. On Wednesday, however, the differential backup would backup any data that had changed since Monday (full backup). The advantage that differential backups have over incremental is shorter restore times. Restoring a differential backup never requires more than two copies. Of course the tradeoff is that as time progresses, a differential backup can grow to contain much more data than an incremental backup.

Key features of CAS

Data integrity, content authenticity, single instance storage, retention enforcement, scalability, location independence, data protection, performance, self healing, and audit trails The key features of CAS are as follows: Content integrity: It provides assurance that the stored content has not been altered. If the fixed content is altered, CAS generates a new address for the altered content, rather than overwriting the original fixed content. Content authenticity: It assures the genuineness of stored content. This is achieved by generating a unique content address for each object and validating the content address for stored objects at regular intervals. Content authenticity is assured because the address assigned to each object is as unique as a fingerprint. Every time an object is read, CAS uses a hashing algorithm to recalculate the object's content address as a validation step and compares the result to its original content address. If the object validation fails, CAS rebuilds the object using protection scheme. Single-instance storage: CAS uses a unique content address to guarantee the storage of only a single instance of an object. When a new object is written, the CAS system is polled to see whether an object is already available with the same content address. If the object is available in the system, it is not stored; instead, only a pointer to that object is created. Retention enforcement: Protecting and retaining objects is a core requirement of an archive storage system. After an object is stored in the CAS system and the retention policy is defined, CAS does not make the object available for deletion until the policy expires. Scalability: CAS allows the addition of more nodes to the cluster to scale without any interruption to data access. Location independence: CAS uses a unique content address, rather than directory path names or URLs, to retrieve data. This makes the physical location of the stored data irrelevant to the application that requests the data. Data protection: CAS ensures that the content stored on the CAS system is available even if a disk or a node fails. CAS provides both local and remote protection to the data objects stored on it. In the local protection option, data objects are either mirrored or parity protected. In mirror protection, two copies of the data object are stored on two different nodes in the same cluster. This decreases the total available capacity by 50 percent. In parity protection, the data object is split in multiple parts and parity is generated from them. Each part of the data and its parity are stored on a different node. This method consumes less capacity to protect the stored data, but takes slightly longer to regenerate the data if corruption of data occurs. In the remote replication option, data objects are copied to a secondary CAS at the remote location. In this case, the objects remain accessible from the secondary CAS if the primary CAS system fails. Performance: CAS stores all objects on disks, which provides faster access to the objects compared to tapes and optical discs. Self-healing: CAS automatically detects and repairs corrupted objects and alerts the administrator about the potential problem. CAS systems can be configured to alert remote support teams who can diagnose and repair the system remotely. Audit trails: CAS keeps track of management activities and any access or disposition of data. Audit trails are mandated by compliance requirements.

Target-based deduplication

Data is deduplicated at the target -inline -post-process offloads the backup client from deduplication process requires sufficient network bandwidth In some implementations, part of the deduplication load is moved to the backup server -reduces the burden on the target -improves the overall backup performance Target-based data deduplication occurs at the backup device or at the backup appliance, which offloads the backup client from the deduplication process. The figure on the slide illustrates target-based data deduplication. In this case, the backup client sends the data to the backup device and the data is deduplicated at the backup target, either immediately (inline) or at a scheduled time (post-process). Inline deduplication performs deduplication on the backup data before it is stored on the backup device. With inline data deduplication, the incoming backup stream is divided into smaller chunks, and then compared to data that has already been deduplicated. The inline deduplication method requires less storage space than the post process approach. However, inline deduplication may slow down the overall backup process. Some vendors' inline deduplication systems leverage the continued advancement of CPU technology to increase the performance of the inline deduplication by minimizing disk accesses required to deduplicate data. Such inline deduplication systems identify duplicate data segments in memory, which minimizes the disk usage. In post-processing deduplication, the backup data is first stored to the disk in its native backup format and deduplicated after the backup is complete. In this approach, the deduplication process is separated from the backup process and the deduplication happens outside the backup window. However, the full backup data set is transmitted across the network to the storage target before the redundancies are eliminated. So, this approach requires adequate storage capacity and network bandwidth to accommodate the full backup data set. Organizations can consider implementing target-based deduplication when their backup application does not have built-in deduplication capabilities. In some implementations, part of the deduplication functionality is moved to the backup host or backup server. This reduces the burden on the target backup device for performing deduplication and improves the overall backup performance.

file-level deduplication

Detects/removes redundant copies. Stores only 1 copy. the subsequent copies are replaced with pointer to the original file *does not address the problem of duplicate content inside the files

EMC Avamar

Disk-based backup and recovery solution that provides source-based data deduplication. Three components: Avamar server, Avamar backup client, Avamar administrator. avamar provides a variety of options for backup, including guest os-level backup and image-level backup

Data deduplication in primary storage

Eliminates redundant data block in Primary storage. All incoming data writes are chunked into blocks. -each block is fingerprinted(hash value) based on the data content -each fingerprinted block is compared to the existing blocks before it is written to the storage system *if the block is already existing, the data block is not written to the disk *else, this unique data block is written to the disk reduces the primary storage requirement and TCO Improves the effective utilization of storage Today, organizations usually deploy primary storage system for their production environment in order to meet the required service levels. The costs of these storage resources are very expensive. So, it is important for organizations to effectively utilize and manage these storage resources. Typically a lot of duplicate data is found in the production environment that unnecessarily consumes more storage resources that leads to more total cost of ownership (TCO). To avoid this situation, now most of the primary storage system (block-based storage and NAS) supports deduplication technology to eliminate the duplicate data. This slide focuses on blockbased storage system. The block-based storage system processes the data in blocks as it enters the storage controller. All incoming data writes are chunked into blocks, and each block is fingerprinted (hash value) based on the data content within the block. The fingerprinting methodology provides a uniform distribution of values. Even a single bit of difference between any two blocks results in completely uncorrelated fingerprint values. An advantage of fingerprinting each block is that before a block is written, its fingerprint is compared to existing blocks in the storage system. If the block already exists in the system, the data will not be written to the disk. By eliminating the redundant data on the primary storage, the organization could save cost on storage. Running deduplication consumes resources in the primary storage and may impact the performance of the storage system.

NDMP 3-way (Remote Method)

Enables the NAS head to control the backup device and share it with other NAS heads. Useful when limited backup devices.

full backup

Full backup: As the name applies, it is a full copy of the entire data set. Organizations typically use full backup on a periodic basis because it requires more storage space and also takes more time to back up. The full backup provides a faster data recovery.

EMC SourceOne

Helps organizations to archive aging email, files, and microsoft SharePoint content to the appropriate storage tiers EMC SourceOne family of products include EMC SourceOne Email Management for archiving e-mail messages and other items. EMC SourceOne for Microsoft SharePoint for archiving SharePoint site content. EMC SourceOne for File Systems for archiving files from file servers. EMC SourceOne Discovery Manager for discovering, collecting, preserving, reviewing, and exporting relevant content. EMC SourceOne Supervisor for monitoring corporate policy compliance.

incremental backup

Incremental backup: It copies the data that has changed since the last backup. For example, a full backup is created on Monday, and incremental backups are created for the rest of the week. Tuesday's backup would only contain the data that has changed since Monday. Wednesday's backup would only contain the data that has changed since Tuesday. The primary disadvantage to incremental backups is that they can be time-consuming to restore. Suppose an administrator wants to restore the backup from Wednesday. To do so, the administrator has to first restore Monday's full backup. After that, the administrator has to restore Tuesday's copy, followed by Wednesday's.

Incremental forever backup

Incremental forever backup: Rather than scheduling periodic full backups, this backup solution requires only one initial full backup. Afterwards, an ongoing (forever) sequence of incremental backups occurs. The real difference, however, is that the incremental backups are automatically combined with the original in such a way that you never need to perform a full backup again. This method reduces the amount of data that goes across the network and reduces the length of the backup window.

Drivers for Data Deduplication

Limited budget and Backup Window. Bandwidth Constrain and Longer retention period. With the growth of data and 24x7 service availability requirements, organizations are facing challenges in protecting their data. Typically, a lot of redundant data is backed up that significantly increases the backup window size and also results in unnecessary consumption of resources, such as backup storage space and network bandwidth. There are also requirements to preserve data for longer periods - whether driven by the need of consumers or legal and regulatory concerns. Backing up large amount of duplicate data at the remote site for DR purpose is also very cumbersome and requires lots of bandwidth. Data deduplication provides the solution for organizations to overcome these challenges in a backup environment.

Measuring Information availability

MTBF MTTR IA

IA

MTBF/(MTBF + MTTR) or IA = uptime/ (uptime + downtime)

MTBF

Mean Time Between Failures

key components of NDMP

NDMP client -it is an NDMP enabled backup software installed as add-on software on backup server -instructs the NAS head to start backpu NDMP server -NAS head acts as an NDMP server which performs backup and sends the data to backup device *the NAS had uses its data server to read the data from the storage * The NAS head then uses its media server to send data read by the data server to backup device -only backup metadata is transferred over production LAN The key components of an NDMP infrastructure are NDMP client and NDMP server. NDMP client is the NDMP enabled backup software installed as add-on software on backup server. The NDMP server has two components: data server and media server. The data server is a component on a NAS system that has access to the file systems containing the data to be backed up. The media server is a component on a NAS system that has access to the backup device. The backup operation occurs as follows: 1. Backup server uses NDMP client and instructs the NAS head to start the backup 2. The NAS head uses its data server to read the data from the storage 3. The NAS head then uses its media server to send the data read by the data server to the backup device In this backup operation, NDMP uses the production network only to transfer the metadata. The actual backup data is either directly transferred to backup device (NDMP 2-way) or through private backup network (NDMP 3-way), by the NAS head. NDMP 2-way (Direct NDMP method): In this method, the backup server uses NDMP over the LAN to instruct the NAS head to start the backup. The data to be backed up from the storage is sent directly to the backup device. In this model, network traffic is minimized on the production network by isolating backup data movement from the NAS head to a locally attached backup device. During the backup, metadata is transferred via NDMP over the LAN to the backup server. During a restore operation, the backup server uses NDMP over the LAN to instruct the NAS to start restoring files. Data is restored from the locally attached backup device. NDMP 3-way (Remote NDMP method): In this method, the backup server uses NDMP over the LAN to instruct the NAS head (A) to start backing up data to the backup device attached to NAS head (B). These NAS devices can be connected over a private backup network to reduce the impact on the production LAN network. During the backup, the metadata is sent via NDMP by the NAS head (A) to the backup server over the production LAN network. In the figure shown on the slide, NAS head (A) performs the role of data server and other NAS head performs the role of media server. NDMP 3-way is useful when there are limited backup devices in the environment. It enables the NAS head to control the backup device and share it with other NAS heads by receiving backup data through NDMP.

NDMP

Network Data Management Protocol

EMC Mozy

SaaS solution for secure, cloud-based online backup and recovery Provides automatic and scheduled backups Supports mobile-based backup

Global deduplication

Single hash index is shared across multiple appliances (nodes) -ensures the data is backed up only once across the backup environment -deduplication is more effective- provides better deduplication ratio creates smaller storage footprints and reduces storage costs best suited for environment with large amount of backup data across multiple locations In a global data deduplication, a single hash is shared among the appliances (nodes) to ensure that the data is backed up only once across the backup environment. Global data deduplication provides more effective data deduplication and increases the data deduplication ratio. Users with large amounts of backup data across multiple locations are most benefited from this approach. Global deduplication provides the following benefits: • Creates smaller storage footprints and reduces storage costs • Decreases the network bandwidth requirements for data replication • Eliminates data silos in a backup environment • Simplifies and centralizes the management of deduplication appliances

Synthetic backup

Synthetic backup: Another way to implement full backup is synthetic backup. This method is used when the production volume resources cannot be exclusively reserved for a backup process for extended periods to perform a full backup. A synthetic backup takes data from an existing full backup and merges it with the data from any existing incrementals and cumulatives. This effectively results in a new full backup of the data. This backup is called synthetic because the backup is not created directly from production data. A synthetic full backup enables a full backup copy to be created offline without disrupting the I/O operation on the production volume. This also frees up network resources from the backup process, making them available for other production uses.

EMC Data Domain

Target-based deduplication solution. data domain boost software increases backup performance by distributing parts of deduplication process to the backup server provides secure multi-tenancy supports backup and archive in a single system

RPO (Recovery Point Objective)

The point in time, relative to a disaster, where the data recovery process begins. Amount of data loss that a business can endure Recovery Point Objective (RPO): This is the point-in-time to which systems and data must be recovered after an outage. It defines the amount of data loss that a business can endure. Based on the RPO, organizations plan for the frequency with which a backup or replica must be made. An organization can plan for an appropriate BC technology solution on the basis of the RPO it sets. For example, if the RPO of a particular business application is 24 hours, then backups are created every midnight. The corresponding recovery strategy is to restore data from the set of last backup.

data deduplication

The process of detecting and identifying the unique data segments within a given set of data to eliminate redundancy deduplication process -chunk the data set -identify duplicate chunk -eliminates the redundant chunk deduplication could be performed in backup as well as in production environment effectiveness of deduplication is expressed as a deduplication ratio Deduplication is the process of detecting and identifying the unique data segments (chunk) within a given set of data to eliminate redundancy. The use of deduplication techniques significantly reduces the amount of data to be backed up. Data deduplication operates by segmenting a dataset into blocks and identifying redundant data and writing the unique blocks to a backup target. To identify redundant blocks, the data deduplication system creates a hash value or digital signature—like a fingerprint—for each data block and an index of the signatures for a given repository. The index provides the reference list to determine whether blocks already exist in a repository. When the data deduplication system sees a block it has processed before, instead of storing the block again, it inserts a pointer to the original block in the repository. It is important to note that the data deduplication can be performed in backup as well as in production environment. In production environment, the deduplication is implemented at primary storage systems to eliminate redundant data in the production volume. The effectiveness of data deduplication is expressed as a deduplication ratio, denoting the ratio of data before deduplication to the amount of data after deduplication. This ratio is typically depicted as "ratio:1" or "ratio X", (10:1 or 10 X). For example, if 200 GB of data consumes 20 GB of storage capacity after data deduplication, the space reduction ratio is 10:1. Every data deduplication vendor claims that their product offers a certain ratio of data reduction. However, the actual data deduplication ratio varies, based on many factors. These factors are discussed next.

Data Archiving

The process of identifying and moving inactive data out of current production systems and into specialized long-term archival storage systems. data archive is a repository where fixed content is stored organizations set their own policies for qualifying data to archive archiving enables organizations to: -reduce on-going primary storage acquisition costs - to meet regulatory compliance - to reduce backup challenges including backup window by moving static data out of the recurring backup stream process -to make use of these data for generating new revenue strategies In the information lifecycle, data is actively created, accessed, and changed. As data ages, it is less likely to be changed and eventually becomes "fixed" but continues to be accessed by applications and users. This data is called fixed content. Assets such as X-rays, MRIs, CAD/CAM designs, surveillance video, MP3s, and financial documents are just a few examples of fixed data that is growing at over 90% annually. Data archiving is the process of moving data (fixed content) that is no longer actively accessed to a separate low cost archival storage tier for long term retention and future reference. Data archive is a storage repository that is used to store these data. Organizations set their own policies for qualifying data to be moved into archives. These policy settings are used to automate the process of identifying and moving the appropriate data into the archive system. Organizations implement archiving processes and technologies to reduce primary storage cost. With archiving, the capacity on expensive primary storage can be reclaimed by moving infrequently-accessed data to lower-cost archive tier. Archiving fixed content before taking backup helps to reduce the backup window and backup storage acquisition costs. Government regulations and legal/contractual obligations mandate organizations to retain their data for an extended period of time. The key to determine how long to retain an organization's archives is to understand which regulations apply to the particular industry and which retention rules apply to that regulation. For instance, all publicly traded companies are subject to the Sarbanes-Oxley (SOX) Act which defines e-mail retention requirements, among other things related to data storage and security. Archiving helps organizations to adhere to these compliances. Archiving can help organizations use growing volumes of information in potentially new and unanticipated ways. For example, new product innovation can be fostered if engineers can access archived project materials such as designs, test results, and requirement documents. In addition to meeting governance and compliance requirements, organizations retain data for business intelligence and competitive advantage. Both active and archived information can help data scientists drive new innovations or help to improve current business processes.

Changed block tracking for restoring

This technique reduces recovery time (RTO) compared to full image restores by only restoring the delta of changed VM blocks. determines which blocks have changed since the last backup and restores only the changed VM blocks Changed block tracking for backup: To increase the efficiency of image-based backup, some vendors support incremental backup through tracking changed blocks. This feature identifies and tags any blocks that have changed since the last VM snapshot. This enables the backup application to backup only the blocks that have changed, rather than backing up every block. Changed block tracking technique dramatically reduces the amount of data copied before additional data reduction technologies are applied, reduces the backup windows and the amount of required storage for protecting VMs. Changed block tracking for restoring: This technique reduces recovery time (RTO) compared to full image restores by only restoring the delta of changed VM blocks. During a restore process, it is determined which blocks have changed since the last backup. For example, if a large database is corrupted, a changed block recovery would just restore the parts of the database that has changed since the last backup was made.

MTTR (mean time to repair)

Total downtime/Number of failures

managed backup service

What type of Backup service is this? suitable when a cloud service provider already hosts consumer applications and data, backup service offered by the provider to protect consumer's data, backup is managed by the service provider

scalability

allows the addition of more nodes to the cluster to scale without any interruption

backup

an additional copy of production data, created and retained for the sole purpose of recovering lost or corrupted data

NDMP define

an open standard TCP/IP-based protocol specifically designed for a backup in a NAS environment data can be backed up using NDMP regardless of the OS or platform backup data is sent directly from NAS to the backup device -no longer necessary to transport data through application servers backs up and restores data while preserving security attributes of file system (NFS and CIFS) and maintains data integrity As the amount of unstructured data continues to grow exponentially, organizations face the daunting task of ensuring that critical data on NAS systems are protected. Most NAS heads run on proprietary operating systems designed for serving files. To maintain its operational efficiency generally it does not support the hosting of third-party applications such as backup clients. This forced backup administrators to backup data from application server or mount each NAS volume via CIFS or NFS from another server across the network, which hosted a backup agent. These approaches may lead to performance degradation of application server and production network during backup operations, due to overhead. Further, security structures differ on the two network file systems, NFS and CIFS. Backups implemented via one of the file system would not effectively backup any data security attributes on the NAS head that was accessed via a different file system. For example, CIFS backup, when restored, would not be able to restore NFS file attributes and vice-versa. These backup challenges of the NAS environment can be addressed with the use of Network Data Management Protocol (NDMP). NDMP is an industry-standard TCP/IP-based protocol specifically designed for a backup in a NAS environment. It communicates with several elements in the backup environment (NAS head, backup devices, backup server, and so on) for data transfer and enables vendors to use a common protocol for the backup architecture. Data can be backed up using NDMP regardless of the operating system or platform. NDMP backs up and restores data without losing the data integrity and file system structure (with respect to different rights and permission in different file systems). Due to its flexibility, it is no longer necessary to transport data through the application server, which reduces the load on the application server and improves the backup speed. NDMP optimizes backup and restore by leveraging the high-speed connection between the backup devices and the NAS head. In NDMP, backup data is sent directly from the NAS head to the backup device, whereas metadata is sent to the backup server.

Persistent Application State Model

application state information is stored out of the memory stored in a data repository if an instance fails, the state information is still available in the repository

Archiving Storage Device

archived data can be stored tape, disk, or cloud

content authenticity

assures the genuineness of stored content

self-healing

automatically detects and repairs corrupted objects

retention enforcement

configurable retention settings ensure content is not erased prior to the expiration of its defined retention period

Image-based backup approach

creates a copy (snapshot) of the entire virtual disk and configuration data associated with a particular VM -backup is saved as a single entity called a VM image *enables quick restoration of a VM -supports recovery at VM-level and file-level -No agent is required inside the VM to perform backup -backup processing is offloaded from VMs to a proxy server Image-level backup makes a copy of the virtual disk and configuration associated with a particular VM. The backup is saved as a single entity called as VM image. This type of backup is suitable for restoring an entire VM in the event of a hardware failure or human error such as the accidental deletion of the VM. It is also possible to restore individual files and folders/directories within a virtual machine. In an image-level backup, the backup software can backup VMs without installing backup agents inside the VMs or at the hypervisor-level. The backup processing is performed by a proxy server that acts as the backup client, thereby offloading the backup processing from the VMs. The proxy server communicates to the management server responsible for managing the virtualized compute environment. It sends commands to create a snapshot of the VM to be backed up and to mount the snapshot to the proxy server. A snapshot captures the configuration and virtual disk data of the target VM and provides a point-in-time view of the VM. The proxy server then performs backup by using the snapshot. The figure on the slide illustrates image-level backup.

source-based deduplication

data is deduplicated at the source(backup client) -backup client sends only new, uinique segments across networks reduced storage capacity and network bandwidth requirements recommended for ROBO environment for taking centralized backup cloud service providers can also implement this method when performing backup from consumers location to their location Source-based data deduplication eliminates redundant data at the source (backup clients) before it transmits to the backup device. The deduplication system consists of multiple backup clients and deduplication server. The deduplication agent is installed in the backup client to perform deduplication. The deduplication server maintains a hash index of the deduplicated data. The deduplication agent running on the clients checks each file for duplicate content. It creates the hash value for each chunk of the file and checks the hash value with the deduplication server whether the hash is present on the server. If there is no match on the server, the client will send the hash and the corresponding chunk to the deduplication server to store the backup data. If the chunk is already been backed up, then the chunk will not be sent to the deduplication server by the client, which ensures redundant backup data is eliminated at the client. The deduplication server can be deployed in different ways. The deduplication server software can be installed on a general purpose server or on VMs that access the backup target available in the environment (as shown in the figure). Some vendors offer deduplication server along with backup device as an appliance. The deduplication server would support encryption for secure backup data transmission and would also support replication for disaster recovery purpose. Source-based deduplication reduces the amount of data that is transmitted over a network from the source to the backup device, thus requiring less network bandwidth. There is also a substantial reduction in the capacity required to store the backup data. Backing up only unique data from the backup client reduces the backup window. However, a deduplication agent running on the client may impact the backup performance, especially when a large amount of data needs to be backed up. When image-level backup is implemented, the backup workload is moved to a proxy server. The deduplication agent is installed on the proxy server to perform deduplication, without impacting the VMs running applications. Organization can implement source-based deduplication when performing remote office branch office (ROBO) backup to their centralized data center. Cloud service providers can also implement source-based deduplication when performing backup (backup as a service) from consumer's location to their location.

data protection

ensures content stored on the CAS system is available even if a disk or a node fails

change rate

fewer the changes to the content between backups, the greater is the efficiency of deduplication

Deduplication Granularity level

file-level deduplication sub-file level deduplication The level at which data is identified as duplicate affects the amount of redundancy or commonality. The operational levels of deduplication include file-level deduplication and sub-file deduplication. File-level deduplication (also called single instance storage) detects and removes redundant copies of identical files in a backup environment. Only one copy of the file is stored; the subsequent copies are replaced with a pointer to the original file. By removing all of the subsequent copies of a file, a significant amount of space savings can be achieved. File-level deduplication is simple but does not address the problem of duplicate content inside the files. A change in any part of a file also results in classifying that as a new file and saving it as a separate copy. For example, two 10-MB presentations with a difference in just the title page are not considered as duplicate files, and each file is stored separately. Sub-file deduplication breaks the file into smaller blocks and then uses a standard hash algorithm to detect redundant data within and across the file. As a result, sub-file deduplication eliminates duplicate data across files. There are two forms of sub-file deduplication, fixed-length and variable-length. The fixed-length block deduplication divides the files into fixed length blocks and uses a hash algorithm to find duplicate data. Although simple in design, the fixed-length block may miss opportunities to discover redundant data because the block boundaries of similar data may be different. For example, the addition of a person's name to a document's title page may shift the whole document, and make all blocks appear to have changed, causing the failure of the deduplication method to detect equivalencies. In variable-length block deduplication, if there is a change in the block, then the boundary for that block only is adjusted, leaving the remaining blocks unchanged. Consequently, more data is identified as common data, and there is less backup data to store as only the unique data is backed up. Variable-length block deduplication yields a greater granularity in identifying duplicate data, improving upon the limitations of filelevel and fixed-length block level deduplication.

Backup Granularity

full backup incremental backup cumulative backup synthetic backup incremental forever backup

Changed block tracking for backup

identifies and tags any blocks that have changed since the last VM snapshot enables the backup application to back up only the blocks that have changed, rather than backing up every block

audit trails

keeps track of management activities and any access or disposition of data

retention period

longer the data retention period, the greater is the chance of identical data existence in the backup

frequency of full backup

more frequently the full backups are conducted, the greater is the advantage of deduplication

location independence

physical location of the storage data irrelevant tot he application that requests the data

archiving server

policy is configured

cas data integrity

provides assurance that the stored content has not been alerted

EMC Spanning

provides backup and recovery service for SaaS applications helps organizations to protect and manage their information in cloud allows administrators and end users to search for, restore, and export data

performance

provides faster access to the objects compared to tapes and optical discs

EMC Centera

purpose-built archiving storage platform facilitates governance and compliance needs for retention and preservation compared to traditional archive storage EMC Centera: - provides single instance storage and self-healing - provides guaranteed content authenticity - supports industry and regulatory standards

Fault detection and retry logic

refers to a mechanism that implements a logic in the code of an application to improve the availability to detect and retry and service that is temporarily down may result in successful restore and service

Archiving Agent

running on application servers responsible for scanning the data that can be archived

deduplication method

the highest amount of deduplication across an organization is discovered using variable-length, sub-file deduplication

data type

the more unique the data, the less intrinsic duplication exists

Key requirements for data archiving solutions

to provide automated policy-driven archiving to provide scalability, authenticity, immutability, availability, and security support single instance storage and variety of online storage options (disk and cloud-based storage) to provide rapid retrieval of archived data when required capable of handling a variety of electronic documents, including e-mail, instant messages, files to provide features for indexing, searching, and reporting -supports for eDiscovery to enable legal investigations and litigation holds Archiving solutions should meet an organization's compliance requirements through automated policy-driven data retention and deletion. It should provide the features such as scalability, authenticity, immutability, availability, and security. The archiving solution should be able to authenticate the creation and integrity of files in the archive storage. Long-term reliability is key for archiving solutions because failure of an archive system could have disastrous consequences. These systems hold critical documents, and any failure could have compliance, legal, and business consequences. In order to manage the large volumes of data, an important technical requirement of an archiving solution is single instance storage (storage system that enables to keep one copy of content and eliminates duplicate data). The archiving solution should support variety of online storage options such as disk-based storage and cloud-based storage. Another key factor is to provide support for variety of data types including e-mails, databases, pdfs, images, audios, videos, binary files, and HTML files. Powerful indexing and searching capability on archiving solutions speeds up the data retrieval. An effective archival system needs to support complex searches of content within documents. Archiving solutions should enable electronic discovery (eDiscovery) and sharing of data for litigation purposes in a timely and compliant manner. Reporting capabilities are required to process huge volumes of data and deliver customized reports for compliance requirements.

MTBF calculation

total uptime/number of failures

EMC InfoArchive

unified archiving platform that stores structured data and unstructured content in a single, consolidated repository provides the ability to audit and preserve data and content to meet a variety of regulatory and governance mandates stores information in an open, industry-standard format for long term retention and easy access

single instances storage

uses a unique content address to guarantee the storage of only a single instance of an object

replicated backup service

when the service provider only manages data replication and IT infrastructure at disaster recovery site and the local backups are managed by the consumer organization


Related study sets

ENT Chapter 1, 2, & 3 Smartbook ?'s

View Set

Chapter 15 Psychological Disorders

View Set

Critical Thinking and the Nursing Practice

View Set

Brand Management MKTG 405: Chapter 4

View Set

Herb Leventer Med Ethics midterm

View Set