Disaster Recovery (DR) is a set of plans and preparations to recover from major disruptions. It is initiated when:
An incident escalates into a disaster.
A natural or man-made event immediately qualifies as a disaster.
Examples of disasters:
Ransomware that disables all systems.
Fire, flood, earthquake damaging the primary facility
Triggered when:
The Incident Response (IR) plan is no longer sufficient.
The organization’s primary site or systems become inoperable.
Examples:
A malware outbreak shuts down 90% of the company’s systems.
A hurricane floods the data center and power infrastructure.
The Contingency Planning (CP) team creates the DRPT.
DRPT responsibilities:
Developing the DR policy.
Organizing Disaster Recovery Response Teams (DRRTs).
Coordinating all DR planning and execution.
Ensuring all teams are trained and assigned.
DR Management Team:
Oversees all DRRTs and coordinates efforts on-site.
Communications Team:
Handles internal and external communication.
Works with PR and Legal to maintain trust and compliance.
Computer Recovery (Hardware) Team:
Recovers usable physical assets.
Orders and configures replacement equipment.
Systems Recovery (OS) Team:
Restores operating systems (Windows, Linux, etc.).
May merge with the application or hardware teams.
Network Recovery Team:
Assesses and restores wiring, routers, and switches.
Restores Internet and internal connectivity.
Storage Recovery Team:
Restores on-site, off-site, or cloud-based storage.
Often works closely with systems and hardware teams.
Applications Recovery Team:
Recovers and validates critical software systems.
Data Management Team:
Handles data restoration from backups, journals, and replication.
Validates data integrity post-recovery.
Vendor Contact Team:
Coordinates with equipment suppliers, service providers, and utility companies.
Damage Assessment & Salvage Team:
Provides initial estimates of damage and salvageable assets.
Business Interface Team:
Helps non-IT departments resume function.
Logistics Team:
Manages space, supplies, facilities, food, and support services.
The primary role of the DR plan is to prepare to restore operations either at the primary site post-disaster or at an alternate site if the original site is unusable.
Recovery efforts are guided by:
Business Impact Analysis (BIA).
Predefined team roles and recovery procedures.
Organize the DR Team:
Initial appointments are made by the CPMT.
Team roles are defined, including team lead and technical/logistical specialists.
Additional personnel are added as planning evolves.
Develop the DR Policy Statement:
A formal statement provides authority, strategic direction, scope, and purpose.
It aligns with organizational goals and legal/regulatory requirements.
Review the BIA:
Use insights from the Business Impact Analysis to identify critical systems and assets and prioritize recovery tasks.
Example: Email and payroll systems are prioritized over marketing archive servers.
Identify Preventive Controls:
Aim to reduce the effects of disasters before they happen.
Examples: surge protectors, backup power (UPS), off-site data replication, and fire suppression systems.
Create DR Strategies:
Define how operations will be restored, including Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).
Strategies may include hot/cold/warm sites, cloud-based failover systems, and remote workforce activation.
Develop the DR Plan Document:
Includes step-by-step procedures for each recovery team, site recovery checklists, and resource/vendor contacts.
The plan must be clear and easy to execute under pressure.
Test, Train, and Exercise the DR Plan:
Testing validates procedures (e.g., tabletop exercises, simulations).
Training prepares personnel for real emergencies.
Exercises uncover weaknesses in coordination or execution.
Maintain the Plan:
The DR plan is a living document and should be updated after system or personnel changes and following test outcomes or actual disasters.
Include a review schedule (e.g., annually).
The DR team, led by the DR team leader, begins with the development of the DR policy soon after the team is formed. The DR policy contains the following key elements:
Purpose
Scope
Roles and responsibilities
Resource requirements
Training requirements
Exercise and testing schedules
Plan maintenance schedule
Special considerations
Covers unique organizational needs, such as:
Data storage and retention.
Sensitive data handling.
Cross-border compliance (e.g., GDPR, HIPAA).
May include provisions for:
Vendor contract reviews.
Cloud vs. on-prem recovery needs.
Disaster classification is the process of:
Evaluating the nature, scope, and severity of an event.
Deciding if it qualifies as a disaster rather than just an incident.
Impacts:
Triggers DR plan activation.
Determines resource allocation and urgency.
The most common classification method is based on projected or actual damage.
Example scale:
Moderate – Limited system/data impact.
Severe – Widespread service degradation.
Critical – Total service outage or physical destruction.
Natural Disasters:
Earthquakes, floods, wildfires, hurricanes, lightning strikes.
Man-made Disasters:
Malware outbreaks, DDoS attacks, insider threats, sabotage.
Some events, like fires, may fall into both:
Natural: Lightning strike.
Man-made: Arson or faulty wiring.
Many disasters begin as incidents.
Escalation happens when:
The incident exceeds response capacity.
The scope and impact go beyond incident containment thresholds.
Example: A localized DDoS on one server becomes disaster-level when it disrupts all customer portals for hours.
Develop gradually over time.
Examples (natural and man-made):
Drought, deforestation, environmental degradation.
Malware infections spreading unnoticed.
Loss of vendor services or chronic infrastructure decay.
Require monitoring and forecasting for early action.
Example: A persistent service provider issue slowly degrading operations.
Strike suddenly with little or no warning.
Often cause immediate and severe impact.
Natural causes:
Earthquakes, tornadoes, floods, storm winds, mudflows.
Man-made causes:
Cyberterrorism, DDoS attacks, hacktivism, war.
Example: A zero-day exploit triggers ransomware across all endpoints within minutes.
Start with your most valuable resource: people.
Ask: Do we have enough cross-trained employees to restore operations?
Ensure:
Cross-training for critical roles.
Readiness to lead recovery, not just follow instructions.
Everyone on the DR team must know:
Their exact role
Who they report to
What systems or people they’re responsible for
Example roles:
Emergency coordination (fire, medical, police)
Evacuation teams
System shutdown teams
Facility safety monitors
Alert systems must reach:
Internal stakeholders
External responders and agencies (e.g., Red Cross, insurance, local authorities)
Example: A DR team coordinator activates the alert roster after an earthquake—key personnel and medical services are notified immediately.
Human life and safety take top priority in any disaster.
Only after all personnel are accounted for:
Begin protecting systems, data, and infrastructure
Clear prioritization in the plan helps avoid confusion in high-stress moments.
Document the event from beginning to end:
Timeline of actions
Decisions taken
Observations and evidence
Use for:
Legal defense
Insurance claims
Post-disaster analysis
Plan improvement
DR team members should be trained to:
Secure systems and data before evacuation
Evacuate physical assets, if possible
Isolate compromised systems
Ensure clean shutdowns to prevent data loss
Prepare alternative systems in case primaries fail:
Spare equipment
Failover services (cloud/DR agency)
Dynamic systems (e.g., DHCP for auto IP assignment)
Promote:
Network auto-reconfiguration
System fault tolerance
Interoperability to avoid "disaster after the disaster"
Card 1: Personal Emergency Card
Includes:
Next of kin contact
Medical conditions/allergies
Personal ID info
Card 2: Organizational DR Snapshot
Includes:
DR coordinator contact
Evacuation procedures and shelter locations
Emergency services numbers
Hotline or status-check contact for employees
The Contingency Planning Management Team (CPMT) must:
Anticipate unexpected obstacles
Empower DR teams to make real-time decisions
Examples of flexibility:
Swapping alternate locations
Using backup methods for communication
Reassigning responsibilities on the fly
If primary physical facilities survive:
Begin immediate restoration of systems, networks, data, and apps.
The DR team shifts from containment to reconstruction mode.
Restore:
Power and connectivity
Critical infrastructure
User access systems
If the primary site is damaged or destroyed:
Execute alternate facility strategies: warm/cold/hot site activation, cloud failover, or partner-hosted continuity centers.
Prioritize:
Relocating operations
Regaining access to critical data and systems
Business Continuity (BC) ensures an organization can continue critical operations when its primary site is unavailable. It focuses on the relocation of business functions, not just technology.
BC is executed alongside the Disaster Recovery (DR) plan.
Managed by the CEO/COO or senior business executives (not just IT).
If the disaster is long-term and severely impacts the primary site.
If business revenue and operations depend on continuity (e.g., retail, manufacturing, financial services).
Not always needed for small companies or organizations that can pause operations temporarily.
Form the BC Team:
The CPMT appoints a team leader and key departmental representatives.
Roles and responsibilities are defined, documented, and communicated.
Develop BC Policy:
Outlines the executive vision, scope of continuity operations, and authority to act during disruption.
Acts as the foundation for all planning actions.
Review the BIA:
Reuses data from the Business Impact Analysis to identify critical business functions and prioritize systems to relocate.
Aligns continuity planning with business goals and dependencies.
Identify Preventive Controls:
Not unique to BC; largely overlaps with contingency planning and disaster recovery planning, including backups, hot/warm/cold sites, and insurance coverage.
Create Relocation Strategies:
Identify alternate locations, service providers, and leased continuity spaces.
Develop plans for staff relocation, equipment provisioning, and communication rerouting.
Example: Activate an alternate customer support center in a neighboring city.
Develop the BC Plan Document:
Includes step-by-step relocation procedures, contact lists, facility maps, and transportation plans.
Tailored to the site, department, and business function.
Test and Train:
Conduct scenario-based drills, tabletop exercises, and live relocation tests.
Educate employees, vendors, and executives.
Identifies gaps and improves readiness.
Maintain the Plan:
Update the BC plan after system changes, personnel turnover, and after each test or disaster.
Include a review schedule (e.g., every 6–12 months).
Purpose
Scope
Roles and responsibilities
Resource requirements
Training requirements
Exercise and testing schedules
Plan maintenance schedule
Special considerations
Covers overlaps with Disaster Recovery (DR) and information storage and retrieval.
Should identify where detailed plans are stored, responsible personnel, and any unique requirements for extreme events.
Continuity strategies define where and how an organization will continue operations during a disaster.
Strategies include exclusive-use and shared-use options.
Increasingly, cloud-based solutions are being used as modern alternatives.
The organization has dedicated access to a backup facility, ensuring availability and control during a disaster.
Three main types: Hot Site, Warm Site, and Cold Site.
A fully functional backup facility that includes servers, data, applications, and even workstations.
Requires only the latest data backups and on-site personnel.
Recovery time: Minutes to hours.
Cost: Very high.
Use case: Organizations needing 24/7 uptime, e.g., financial trading firms.
Example: A bank maintains a hot site in another city to resume real-time operations during a cyberattack.
A backup facility with infrastructure and equipment, but no configured software.
Requires time to load systems, apps, and data.
Recovery time: Hours to days.
Cost: Medium.
Use case: Mid-size firms with moderate downtime tolerance
Example: A regional insurance company uses a warm site for periodic testing and partial redundancy.
A facility with basic utilities only and without hardware, software, or comms systems.
Requires bringing or installing all systems post-disaster.
Recovery time: Days to weeks.
Cost: Low.
Use case: Organizations with high-cost sensitivity
Example: A small logistics firm reserves empty space in a warehouse as a cold site.
Shared access to recovery facilities.
Lower cost, but availability is not guaranteed.
Includes Timeshare, Service Bureau, and Mutual Agreement.
Co-leased with a business partner or sister organization.
Cost-efficient.
Risk: Conflicts if both need the facility at once.
Requires complex coordination and mutual trust.
Use case: Shared office parks, universities, or franchise models.
Example: Two universities agree to host each other’s data centers if one is down.
Contract with a third-party to provide facilities and off-site backups with guaranteed access by contract.
Use case: Organizations needing guaranteed but outsourced continuity.
Disadvantage: Can be expensive and restrictive
Example: A SaaS company contracts with Sungard AS to guarantee DR site access.
An agreement between two organizations to assist each other during disasters.
Very cost-effective but may strain relationships.
Risk: Limited capacity to support both orgs during a large-scale disaster
Example: Two regional hospitals agree to share medical equipment and IT support in emergencies.
Rolling Mobile Sites:
Portable, truck-mounted facilities used for quick setup in disaster zones.
Prepositioned Resources:
Older servers, workstations, and networking gear kept off-site.
Cloud-Based Recovery:
Virtual infrastructure provisioning is scalable and location-independent, becoming a popular alternative to physical sites.
Example: A retail company uses AWS Disaster Recovery to spin up systems in minutes.
Incident: Ransomware attack on a single system/user
Disaster: Ransomware attack on all organizational systems/users
Attack occurs: Depending on scope, may be classified as an incident or a disaster
Organizational disaster occurs.
The DR plan works to reestablish operations at either the primary site (or new permanent site) or an alternate site.
Staff implements DR/BC plans; the BC plan relocates the organization to the alternate site.
The timeline showcases the progression from incident detection to business continuity.
Illustrates the relationship between Incident Response (IR), Disaster Recovery (DR), Business Continuity (BC), and Crisis Management (CM).
The timeline includes:
Incident Recovery: Incident detection leads to IR plan activation and incident reaction. If the IR plan can’t contain the incident, it escalates to a disaster.
Disaster Recovery: Begins with disaster reaction and the activation of the DR plan. If the DR can’t restore operations quickly, it triggers BC.
Business Continuity: BC plan is activated, leading to BC operations at an alternate site. DR completion triggers the end of BC.
Crisis Management: Threat of injury or loss of life triggers the CM plan. The CM plan leads to CM operations, and all personnel being safe and accounted for triggers the end of CM.
Contingency Planning Notes
The DR team, led by the DR team leader, begins with the development of the DR policy soon after the team is formed. The DR policy contains the following key elements: