1/71
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Lesson 13
Backup and Recovery
Backup and recovery processes ensure that accurate and reliable copies of data and system configurations are created, maintained, and tested.
Traditional
Online
Replication
Automation
Traditional (Backup and Recovery)
Traditional backup and recovery generally uses removable media or local disk library.
Online (Backup and Recovery)
Online backup and recovery preserve data by creating copies of it and storing them in an online or cloud-based environment.
Replication
Replication is the process of creating and maintaining multiple copies of data across different locations.
Automation (Backup and Recovery)
Automation approach to automated provisioning and replacement.
Traditional Backup Strategies
Full Backup
Differential
Incremental
Full Backup
Backs up all files. Restore requires full backup media.
Differential
Backs up all files created or modified since last full backup. Does not reset archive bit. Restore requires full backup + most recent differential.
Incremental
Backs up all changed files. Does reset archive bit. Restore requires full backup + all subsequent incremental.
Disk-to-Disk Options
Network-attached Storage (NAS)
Storage Area Network (SAN)
Network-attached Storage (NAS)
A NAS is a file-dedicated storage device.
- Connects over Ethernet
- Relatively inexpensive to add additional NAS devices
Storage Area Network (SAN)
A SAN is a specialized high-speed (fibre-channel) network that provides network access to storage devices. SAN creates an image by mirroring a production disk to another disk inside the storage array.
Online Backup Strategies
Cloud
Disk Shadowing
Electronic Vaulting
Remote Journaling
Cloud (Online Backup)
Scheduled backup to a cloud location or cloud backup service.
Disk Shadowing
In disk shadowing, data is written to (and read from) two or more independent disks. Process is transparent to the user.
Electronic Vaulting
Electronic vaulting copies files as they change and periodically transmits them to a secure backup location.
Remote Journaling
Remote journaling copies and periodically transmits logs to a backup location.
Replication Strategies
Point-in-Time
Asynchronous
Synchronous
Point-in-Time (Replication)
Periodic snapshots replicated.
Asynchronous (Replication)
Asynchronous replication is an automated process that streams copies of data. Write is considered complete as soon as local storage commits. Remote storage updated with a slight time lag.
Synchronous (Replication)
Synchronous replication is an automated process that streams copies of data. Both write operations must successfully complete before the system can proceed. Guaranteed zero data loss.
Automation Strategies
Infrastructure-as-Code (IAC)
Immutable System
Infrastructure-as-Code (IAC)
IAC uses code (configuration files) to manage configurations and automate provisioning of infrastructure.
- Solves the problem of configuration drift.
Immutable System
Immutability is the principle that resources should not be changed, only created and destroyed. (e.g., image of a system preconfigured to a desired "known good" state). Uses automation to replace rather than fix.
Cost Balancing
Inverse relationship between the Cost of Disruption and the Cost to Recover.
Traditional Recovery
Tape backup
- Low complexity
- Low cost
- Recovery measured in hours to days
- More recoverable
Enhanced Recovery
Automated solutions
- Medium complexity
- Low cost
- Recovery measured in hours to days
- More recoverable
Rapid Recovery
Asynchronous replication
- High complexity
- Moderate cost
- Recovery measured in minutes to hours
Continuous Availability
High complexity
- High cost
- Recovery measured in seconds
Resiliency
Resiliency is the capability to continue operating even when there has been a disruption or abnormal operating conditions.
Redundancy is duplication of critical components or functions with the intention of increasing reliability and mitigating the risks associated with single point of failure (SPOF).
- Fault tolerance is the capability of a system to continue to operate in the event of failure of one or more system components (redundancy).
- Categories of resiliency include system, storage, power, transmission, and site.
System Resiliency
Load Balancing
Clustering
High Availability
Fail-secure
Load Balancing
Load balancing involves distributing incoming network traffic across multiple independent systems to ensure that no single server becomes overwhelmed with requests.
Clustering
Clustering involves grouping multiple systems together to form a single logical unit or cluster.
High Availability
High Availability (HA) is automatic failover capability which reduces or eliminates the need to activate redundant hardware.
- Asymmetric (active/passive)
- Symmetric (active/active)
Fail-secure
Principle that a failure will result in a secure or trustworthy state.
RAID
Redundant Array of Independent Disks (RAID) is a data storage virtualization technology.
- RAID combines multiple disk drive components into one or more logical units for the purposes of fault tolerance (data redundancy) and/or performance improvement.
- RAID can be configured to mirroring, striping, or both.
- Disk mirroring is the process of writing data on two partitions on separate disks.
- Disk striping is the process of dividing data into blocks and spreading the data blocks across multiple storage devices.
Power Resiliency
Redundancy
UPS Battery Backup
Generator
Supplier Diversity
Redundancy (Power)
Component level: Having two or more power supplies and fans.
UPS Battery Backup
An uninterruptible power supply (UPS) provides backup power when a regular power source fails, or voltage drops to an unacceptable level. Battery is finite.
Generator
A generator is a standby, secondary, limited source of electrical power when the power grid is down or inaccessible. Fuel must be available.
Supplier Diversity
More than one supplier and/or access to multiple power grids.
Transmission Routing Resiliency
Alternate Routing
Diverse Routing
Last-mile Circuit Protection
Alternate Routing
Multiple paths for data to travel between two points. The network can automatically reroute traffic to an alternate path if the primary path becomes unavailable or congested.
Diverse Routing
Data is transmitted over multiple geographically diverse paths or routes.
Last-mile Circuit Protection
Redundant last-mile circuits, such as multiple fiber optic or copper cables, to provide backup paths for data transmission in case of a failure or outage on the primary circuit.
Alternate Physical Sites
Cold Site
Warm Site
Hot Site
Mirrored
Cold Site
A cold site has basic HVAC infrastructure. No server-related or communications equipment.
Warm Site
A warm site has HVAC, servers, and communications infrastructure and equipment. Systems needs to be configured (updated). Data needs to be restored.
Hot Site
A hot site has HVAC, servers, and communications infrastructure and equipment. Fully configured and ready to operate. Data has been replicated.
Mirrored
A mirrored site is an identical (or nearly identical site) that is operational in concert with the primary site on a load-balancing basis.
Alternate Third-Party Sites
Mobile Site
Reciprocal Site
DRaaS
Mobile Site
A mobile site is a transportable modular unit with pre-ordered hardware and software. The delivery site must provide access roads, water, waste disposal, power, and connectivity.
Reciprocal Site
A reciprocal site is based on an agreement to have access to/use of another organization's facilities.
DRaaS
Cloud-based Disaster-Recovery-as-a-Service offers full recovery in a cloud-based environment.
Continuity of Operations
In its simplest form, continuity of operations is the capability of a business to continue to operate in adverse (disaster) conditions.
- In a business context, disasters are disruptive events that significantly impact an organizations capability to operate.
- The impact could be to people, technology, facilities or any combination thereof.
Adverse Conditions
External
Infrastructure
Human
External (Adverse Conditions)
Large scale geological or meteorological events such as earthquakes, storms, floods, hurricanes, tornadoes, wildfires. Environmental events such as pollution, sea rise. Public health events such as a pandemic.
Infrastructure (Adverse Conditions)
Loss of service such as electricity, HVAC, water. Technical issues such as equipment or communications failure.
Human (Adverse Conditions)
Workplace accidents, Walkouts, Strikes, Civil disturbance, Cyber attacks, Cyber warfare, War and terrorism.
Continuity of Operations Governance
Continuity of operations is a shared responsibility.
- Board of Directors (or equivalent) are responsible for approval of continuity of operations policies and oversight of strategy development and testing.
- Management is responsible for the development of strategic and tactical plans and procedures, external relationships, training, testing, and audit.
- Business units are responsible for developing unit-specific procedures.
Continuity of Operations Planning
The objective of continuity of operations planning is to prepare for continued operation.
Disaster Recovery Plans (DRP)
Business Continuity Plans (BCP)
Disaster Recovery Plans (DRP)
DRP focus on the recovery and restoration of technology, physical plant, and personnel.
Business Continuity Plans (BCP)
BCP focuses on the overall strategy for sustaining business activities during a disaster (or smaller interruption) and subsequent recovery period.
Continuity of Operations Planning Workflow
1. Project Initiation
2. Business Impact Analysis
3. Plan Development
4. Procedure Development
5. Training
6. Testing
7. Auditing
8. Maintenance Review & Update
Plan Readiness
Continuity of operations plans (DRP, BCP) should be maintained in a state of readiness.
- Personnel trained to fulfill their roles and responsibilities within the plan.
- Plans and strategies exercised to validate their content.
- Systems and system components tested on a scheduled basis to ensure their recovery and operability.
- Plan examination and auditing to ensure compliance with business objectives.
Testing Objectives
The objective of testing should be to evaluate continuity of operations strategies, plans, and procedures; not institutional knowledge.
- The outcome of testing should be strategy, plan, and procedure modifications (if necessary), and an enhanced participant familiarity with all the facets of the plan.
Testing Approaches
Tabletop
Failover
Simulation
Tabletop (Testing)
Tabletop testing is a hypothetical group workshop that focuses on the application of plans and procedures as well as identifying gaps in their preparedness.
Failover (Testing)
Failover testing is performed to evaluate the ability of a system or application to recover from a failure and switch to a backup or secondary system or component seamlessly.
Simulation (Testing)
In a simulation, DR and/or BCP plans are executed in a controlled environment (e.g., staging), to simulate a real-world disaster or outage. The simulation can be done at different levels of granularity.
Parallel Processing
Business continuity parallel processing is a complex and costly strategy to ensure uninterrupted business operations during unexpected events or disruptions.
- Parallel processing requires the implementation of parallel processing systems that can handle critical business functions simultaneously or in parallel, thereby minimizing the impact of disruptions on overall operations.
Plan Audit
A plan audit provides management with an independent assessment of the effectiveness of plans, procedures, training, and testing, as well as strategic alignment assurance.
- The type and the extent of auditing performed depend on the risks involved, management's assurance requirements, and the availability of audit resources.