Chapter 10: Contingency Planning - Part 2
Introduction to Disaster Recovery (DR)
- Disaster Recovery (DR) is a set of plans and preparations to recover from major disruptions. It is initiated when:
- An incident escalates into a disaster.
- A natural or man-made event immediately qualifies as a disaster.
- Examples of disasters:
- Ransomware that disables all systems.
- Fire, flood, earthquake damaging the primary facility
When Is the DR Plan Activated?
- Triggered when:
- The Incident Response (IR) plan is no longer sufficient.
- The organization’s primary site or systems become inoperable.
- Examples:
- A malware outbreak shuts down 90% of the company’s systems.
- A hurricane floods the data center and power infrastructure.
Disaster Recovery Planning Team (DRPT)
- The Contingency Planning (CP) team creates the DRPT.
- DRPT responsibilities:
- Developing the DR policy.
- Organizing Disaster Recovery Response Teams (DRRTs).
- Coordinating all DR planning and execution.
- Ensuring all teams are trained and assigned.
Common DRRTs and Their Roles
- DR Management Team:
- Oversees all DRRTs and coordinates efforts on-site.
- Communications Team:
- Handles internal and external communication.
- Works with PR and Legal to maintain trust and compliance.
- Computer Recovery (Hardware) Team:
- Recovers usable physical assets.
- Orders and configures replacement equipment.
- Systems Recovery (OS) Team:
- Restores operating systems (Windows, Linux, etc.).
- May merge with the application or hardware teams.
- Network Recovery Team:
- Assesses and restores wiring, routers, and switches.
- Restores Internet and internal connectivity.
- Storage Recovery Team:
- Restores on-site, off-site, or cloud-based storage.
- Often works closely with systems and hardware teams.
- Applications Recovery Team:
- Recovers and validates critical software systems.
- Data Management Team:
- Handles data restoration from backups, journals, and replication.
- Validates data integrity post-recovery.
- Vendor Contact Team:
- Coordinates with equipment suppliers, service providers, and utility companies.
- Damage Assessment & Salvage Team:
- Provides initial estimates of damage and salvageable assets.
- Business Interface Team:
- Helps non-IT departments resume function.
- Logistics Team:
- Manages space, supplies, facilities, food, and support services.
DR Plan's Primary Role
- The primary role of the DR plan is to prepare to restore operations either at the primary site post-disaster or at an alternate site if the original site is unusable.
- Recovery efforts are guided by:
- Business Impact Analysis (BIA).
- Predefined team roles and recovery procedures.
The 8-Step Disaster Recovery Process (Adapted from NIST CP)
- Organize the DR Team:
- Initial appointments are made by the CPMT.
- Team roles are defined, including team lead and technical/logistical specialists.
- Additional personnel are added as planning evolves.
- Develop the DR Policy Statement:
- A formal statement provides authority, strategic direction, scope, and purpose.
- It aligns with organizational goals and legal/regulatory requirements.
- Review the BIA:
- Use insights from the Business Impact Analysis to identify critical systems and assets and prioritize recovery tasks.
- Example: Email and payroll systems are prioritized over marketing archive servers.
- Identify Preventive Controls:
- Aim to reduce the effects of disasters before they happen.
- Examples: surge protectors, backup power (UPS), off-site data replication, and fire suppression systems.
- Create DR Strategies:
- Define how operations will be restored, including Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).
- Strategies may include hot/cold/warm sites, cloud-based failover systems, and remote workforce activation.
- Develop the DR Plan Document:
- Includes step-by-step procedures for each recovery team, site recovery checklists, and resource/vendor contacts.
- The plan must be clear and easy to execute under pressure.
- Test, Train, and Exercise the DR Plan:
- Testing validates procedures (e.g., tabletop exercises, simulations).
- Training prepares personnel for real emergencies.
- Exercises uncover weaknesses in coordination or execution.
- Maintain the Plan:
- The DR plan is a living document and should be updated after system or personnel changes and following test outcomes or actual disasters.
- Include a review schedule (e.g., annually).
Disaster Recovery Policy (Continued)
The DR team, led by the DR team leader, begins with the development of the DR policy soon after the team is formed. The DR policy contains the following key elements:
- Purpose
- Scope
- Roles and responsibilities
- Resource requirements
- Training requirements
- Exercise and testing schedules
- Plan maintenance schedule
- Special considerations
Special Considerations
- Covers unique organizational needs, such as:
- Data storage and retention.
- Sensitive data handling.
- Cross-border compliance (e.g., GDPR, HIPAA).
- May include provisions for:
- Vendor contract reviews.
- Cloud vs. on-prem recovery needs.
What Is Disaster Classification?
- Disaster classification is the process of:
- Evaluating the nature, scope, and severity of an event.
- Deciding if it qualifies as a disaster rather than just an incident.
- Impacts:
- Triggers DR plan activation.
- Determines resource allocation and urgency.
Classification by Damage Severity
- The most common classification method is based on projected or actual damage.
- Example scale:
- Moderate – Limited system/data impact.
- Severe – Widespread service degradation.
- Critical – Total service outage or physical destruction.
Classification by Origin
- Natural Disasters:
- Earthquakes, floods, wildfires, hurricanes, lightning strikes.
- Man-made Disasters:
- Malware outbreaks, DDoS attacks, insider threats, sabotage.
- Some events, like fires, may fall into both:
- Natural: Lightning strike.
- Man-made: Arson or faulty wiring.
Incident Escalation to Disaster
- Many disasters begin as incidents.
- Escalation happens when:
- The incident exceeds response capacity.
- The scope and impact go beyond incident containment thresholds.
- Example: A localized DDoS on one server becomes disaster-level when it disrupts all customer portals for hours.
Slow-Onset Disasters
- Develop gradually over time.
- Examples (natural and man-made):
- Drought, deforestation, environmental degradation.
- Malware infections spreading unnoticed.
- Loss of vendor services or chronic infrastructure decay.
- Require monitoring and forecasting for early action.
- Example: A persistent service provider issue slowly degrading operations.
Rapid-Onset Disasters
- Strike suddenly with little or no warning.
- Often cause immediate and severe impact.
- Natural causes:
- Earthquakes, tornadoes, floods, storm winds, mudflows.
- Man-made causes:
- Cyberterrorism, DDoS attacks, hacktivism, war.
- Example: A zero-day exploit triggers ransomware across all endpoints within minutes.
Planning to Recover
People First – The #1 Asset
- Start with your most valuable resource: people.
- Ask: Do we have enough cross-trained employees to restore operations?
- Ensure:
- Cross-training for critical roles.
- Readiness to lead recovery, not just follow instructions.
Delegation of Roles and Responsibilities
- Everyone on the DR team must know:
- Their exact role
- Who they report to
- What systems or people they’re responsible for
- Example roles:
- Emergency coordination (fire, medical, police)
- Evacuation teams
- System shutdown teams
- Facility safety monitors
Alert Rosters and Notifications
- Alert systems must reach:
- Internal stakeholders
- External responders and agencies (e.g., Red Cross, insurance, local authorities)
- Example: A DR team coordinator activates the alert roster after an earthquake—key personnel and medical services are notified immediately.
Prioritization – People Before Tech
- Human life and safety take top priority in any disaster.
- Only after all personnel are accounted for:
- Begin protecting systems, data, and infrastructure
- Clear prioritization in the plan helps avoid confusion in high-stress moments.
Disaster Documentation Procedures
- Document the event from beginning to end:
- Timeline of actions
- Decisions taken
- Observations and evidence
- Use for:
- Legal defense
- Insurance claims
- Post-disaster analysis
- Plan improvement
Mitigation Steps in the DR Plan
- DR team members should be trained to:
- Secure systems and data before evacuation
- Evacuate physical assets, if possible
- Isolate compromised systems
- Ensure clean shutdowns to prevent data loss
Redundancy and Alternative Systems
- Prepare alternative systems in case primaries fail:
- Spare equipment
- Failover services (cloud/DR agency)
- Dynamic systems (e.g., DHCP for auto IP assignment)
- Promote:
- Network auto-reconfiguration
- System fault tolerance
- Interoperability to avoid "disaster after the disaster"
- Card 1: Personal Emergency Card
- Includes:
- Next of kin contact
- Medical conditions/allergies
- Personal ID info
- Card 2: Organizational DR Snapshot
- Includes:
- DR coordinator contact
- Evacuation procedures and shelter locations
- Emergency services numbers
- Hotline or status-check contact for employees
Built-In Flexibility Is Critical
- The Contingency Planning Management Team (CPMT) must:
- Anticipate unexpected obstacles
- Empower DR teams to make real-time decisions
- Examples of flexibility:
- Swapping alternate locations
- Using backup methods for communication
- Reassigning responsibilities on the fly
If the Facility Is Intact
- If primary physical facilities survive:
- Begin immediate restoration of systems, networks, data, and apps.
- The DR team shifts from containment to reconstruction mode.
- Restore:
- Power and connectivity
- Critical infrastructure
- User access systems
If the Facility Is Destroyed
- If the primary site is damaged or destroyed:
- Execute alternate facility strategies: warm/cold/hot site activation, cloud failover, or partner-hosted continuity centers.
- Prioritize:
- Relocating operations
- Regaining access to critical data and systems
What Is Business Continuity (BC)?
- Business Continuity (BC) ensures an organization can continue critical operations when its primary site is unavailable. It focuses on the relocation of business functions, not just technology.
- BC is executed alongside the Disaster Recovery (DR) plan.
- Managed by the CEO/COO or senior business executives (not just IT).
When to Use Business Continuity
- If the disaster is long-term and severely impacts the primary site.
- If business revenue and operations depend on continuity (e.g., retail, manufacturing, financial services).
- Not always needed for small companies or organizations that can pause operations temporarily.
NIST-Based 8-Step BC Planning Process
- Form the BC Team:
- The CPMT appoints a team leader and key departmental representatives.
- Roles and responsibilities are defined, documented, and communicated.
- Develop BC Policy:
- Outlines the executive vision, scope of continuity operations, and authority to act during disruption.
- Acts as the foundation for all planning actions.
- Review the BIA:
- Reuses data from the Business Impact Analysis to identify critical business functions and prioritize systems to relocate.
- Aligns continuity planning with business goals and dependencies.
- Identify Preventive Controls:
- Not unique to BC; largely overlaps with contingency planning and disaster recovery planning, including backups, hot/warm/cold sites, and insurance coverage.
- Create Relocation Strategies:
- Identify alternate locations, service providers, and leased continuity spaces.
- Develop plans for staff relocation, equipment provisioning, and communication rerouting.
- Example: Activate an alternate customer support center in a neighboring city.
- Develop the BC Plan Document:
- Includes step-by-step relocation procedures, contact lists, facility maps, and transportation plans.
- Tailored to the site, department, and business function.
- Test and Train:
- Conduct scenario-based drills, tabletop exercises, and live relocation tests.
- Educate employees, vendors, and executives.
- Identifies gaps and improves readiness.
- Maintain the Plan:
- Update the BC plan after system changes, personnel turnover, and after each test or disaster.
- Include a review schedule (e.g., every 6–12 months).
Business Continuity Policy
- Purpose
- Scope
- Roles and responsibilities
- Resource requirements
- Training requirements
- Exercise and testing schedules
- Plan maintenance schedule
- Special considerations
BC Policy - Special Considerations
- Covers overlaps with Disaster Recovery (DR) and information storage and retrieval.
- Should identify where detailed plans are stored, responsible personnel, and any unique requirements for extreme events.
What Are Continuity Strategies?
- Continuity strategies define where and how an organization will continue operations during a disaster.
- Strategies include exclusive-use and shared-use options.
- Increasingly, cloud-based solutions are being used as modern alternatives.
Exclusive-Use Strategies Overview
- The organization has dedicated access to a backup facility, ensuring availability and control during a disaster.
- Three main types: Hot Site, Warm Site, and Cold Site.
Hot Site
- A fully functional backup facility that includes servers, data, applications, and even workstations.
- Requires only the latest data backups and on-site personnel.
- Recovery time: Minutes to hours.
- Cost: Very high.
- Use case: Organizations needing 24/7 uptime, e.g., financial trading firms.
- Example: A bank maintains a hot site in another city to resume real-time operations during a cyberattack.
Warm Site
- A backup facility with infrastructure and equipment, but no configured software.
- Requires time to load systems, apps, and data.
- Recovery time: Hours to days.
- Cost: Medium.
- Use case: Mid-size firms with moderate downtime tolerance
- Example: A regional insurance company uses a warm site for periodic testing and partial redundancy.
Cold Site
- A facility with basic utilities only and without hardware, software, or comms systems.
- Requires bringing or installing all systems post-disaster.
- Recovery time: Days to weeks.
- Cost: Low.
- Use case: Organizations with high-cost sensitivity
- Example: A small logistics firm reserves empty space in a warehouse as a cold site.
Shared-Use Strategies Overview
- Shared access to recovery facilities.
- Lower cost, but availability is not guaranteed.
- Includes Timeshare, Service Bureau, and Mutual Agreement.
Timeshare
- Co-leased with a business partner or sister organization.
- Cost-efficient.
- Risk: Conflicts if both need the facility at once.
- Requires complex coordination and mutual trust.
- Use case: Shared office parks, universities, or franchise models.
- Example: Two universities agree to host each other’s data centers if one is down.
Service Bureau
- Contract with a third-party to provide facilities and off-site backups with guaranteed access by contract.
- Use case: Organizations needing guaranteed but outsourced continuity.
- Disadvantage: Can be expensive and restrictive
- Example: A SaaS company contracts with Sungard AS to guarantee DR site access.
Mutual Agreement
- An agreement between two organizations to assist each other during disasters.
- Very cost-effective but may strain relationships.
- Risk: Limited capacity to support both orgs during a large-scale disaster
- Example: Two regional hospitals agree to share medical equipment and IT support in emergencies.
Specialized & Modern Continuity Options
- Rolling Mobile Sites:
- Portable, truck-mounted facilities used for quick setup in disaster zones.
- Prepositioned Resources:
- Older servers, workstations, and networking gear kept off-site.
- Cloud-Based Recovery:
- Virtual infrastructure provisioning is scalable and location-independent, becoming a popular alternative to physical sites.
- Example: A retail company uses AWS Disaster Recovery to spin up systems in minutes.
- Incident: Ransomware attack on a single system/user
- Disaster: Ransomware attack on all organizational systems/users
- Attack occurs: Depending on scope, may be classified as an incident or a disaster
- Organizational disaster occurs.
- The DR plan works to reestablish operations at either the primary site (or new permanent site) or an alternate site.
- Staff implements DR/BC plans; the BC plan relocates the organization to the alternate site.
- The timeline showcases the progression from incident detection to business continuity.
- Illustrates the relationship between Incident Response (IR), Disaster Recovery (DR), Business Continuity (BC), and Crisis Management (CM).
- The timeline includes:
- Incident Recovery: Incident detection leads to IR plan activation and incident reaction. If the IR plan can’t contain the incident, it escalates to a disaster.
- Disaster Recovery: Begins with disaster reaction and the activation of the DR plan. If the DR can’t restore operations quickly, it triggers BC.
- Business Continuity: BC plan is activated, leading to BC operations at an alternate site. DR completion triggers the end of BC.
- Crisis Management: Threat of injury or loss of life triggers the CM plan. The CM plan leads to CM operations, and all personnel being safe and accounted for triggers the end of CM.