Watching That Platform Disaster Recovery Plan

Executive Summary

The Platform Disaster Recovery Plan details guidelines for declaring a disaster with the Watching That Platform. It encompasses roles and responsibilities specifically outlining which team is required to act and how.

The Platform Disaster Recovery Plan describes the step-by-step process on how to recover from the loss of Watching That’s Platform. It includes guidelines and priority of work to ensure that applications and systems are recovered in a timely fashion.

Introduction

Watching That’s Platform Disaster Recovery Plan encompasses the applications, application environment, network and data communications infrastructure that are involved in the platform.

Any event that has a negative impact on a company’s business continuity or finances can be termed a disaster.

This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error or some other other significant event. In order to mitigate the risk of a disaster caused by natural, man-made or acts of God the company has developed a detailed Platform Disaster Recovery Plan.

The plan includes strategies and efforts that the company’s technical and management personnel will need to perform before, during and after a disruption occurs.

Recovery Time Objective and Recovery Point Objective

The Watching That Platform Disaster Recovery Play uses two common industry terms for disaster planning:

Recovery Time Objective (RTO) – The time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the Disaster Recover (DR) process should restore the relevant business process or service to the acceptable service level by 8:00 PM.

Recovery Point Objective (RPO) – The acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will span only one hour, between 11:00 AM and 12:00 PM (noon)

A company typically decides on an acceptable RTO and RPO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of system availability.

IT organisations then plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO

Scope

Watching That’s Platform Disaster Recovery Plan details step-by-step procedures to be taken if there is a distribution in Watching That’s Platform that renders Watching That’s Platform environment inaccessible for an extended period of time. The Platform Disaster Recovery Plan also establishes a priority of work for recovering from the disaster.

Background

1.1 Database Audit and Data Collection

The Platform Disaster Recovery Plan team ascertains and maintains that the database data is being backed up on a frequent basis. The backup and restore of all databases is regularly tested.

1.2 Network Audit

All Watching That’s Platform applications are enclosed in Amazon (AWS) Internet Data Centres.

1.3 Final Review

The Watching That Platform Disaster Recovery Plan is reviewed and tested on an annual basis to keep the plan in sync with current business and Watching That’s Platform environment needs.

Guidelines for Declaring a Disaster

A disaster will be declared if Watching That’s Platform is inaccessible for a period of four hours consecutively or Watching That management believes Watching That’s Platform will be unavailable for a twenty-four (24) hour period. Declaring a Disaster Recovery Event is a serious situation and a conservative approach will be taken when a decision is required.

If the Watching That’s Platform production facility is destroyed a disaster will be declared immediately.

Declaring a disaster is the responsibility of the Development Operations (DevOps) team and the CTO.

Roles and Responsibilities

1.1 Recovery Team and Executive

The CTO is the responsible Disaster Recovery Executive. In the CTO’s absence, the CEO is authorised to make decisions on behalf of Watching That. The DevOps team member will advise the Disaster Recovery Executive about the potential Disaster Recovery situation. The person will determine if an incident should be classified as a Disaster Recovery event, which will put the Platform Disaster Recovery Plan into effect. Furthermore, he or she will notify the Director of Customer Success group and the COO that a Disaster Recovery Event has been declared. The Director of Customer Success will determine the information that will be communicated to customers.

1.2 Manager of DevOps

The Manager of DevOps has overall responsibility to ensure that the Platform Disaster Recovery Plan and Disaster Recovery environment are properly maintained and tested. This person is also the leader of the Disaster Recovery Team and will lead the Disaster Recovery Team in implementing the Platform Disaster Recovery Plan.

1.3 DevOps Team

If a potential Disaster Recovery Plan occurs after hours the On-Call DevOps team member is responsible for identifying a possible incident and following the escalation process.

1.4 Communicating During a Disaster

The Communications Team will be responsible to ensure that all stakeholders will be regularly updated every 4 hours.

1.4.1 Communicating with Employees

The Communications Team will be responsible to ensure that the entire company has been notified of the disaster. The best and/or most practical means of contacting all of the employees will be used with preference on the following methods (in order):

  • Slack
  • SMS / Text / WhatsApp Message
  • Corporate Email
  • Telephone

1.4.2 Communicating with Customers

The Director of Customer Success is responsible for notifying customers of the declaration of a Disaster Recovery Event. The Director of Customer Success will inform customers of the nature of the disaster and estimated time to recovery. There will be regular communications (every 4 – 6 hours) to customers regarding the status and progress for the duration of the Disaster Recovery Event.

Disaster Recovery Process Prerequisites

The disaster recovery process consists of defining rules, processes and disciplines to ensure that the critical business processes will continue to function if there is a failure of one or more of the information processing or telecommunications resources upon which their operations depend. The following are key elements to a disaster recovery plan:

  • Create a list of key assets for each system
  • Perform risk assessment and audits of the assets
  • Establish priorities for applications and networks
  • Develop recovery strategies
  • Prepare inventory and documentation of the plan
  • Develop verification criteria and procedures
  • Implement the plan

Key people from each business unit should be members of the team and included in all disaster recovery planning activities. The disaster recovery planning group needs to understand the business processes, technology, networks and systems in order to create a Disaster Recovery Plan (DRP). A risk and business impact analysis should be prepared by the DRP group that includes at least the top ten potential disasters. After analyzing the potential risks, priority levels should be assigned to each business process and application/system.

It is important to keep inventory up-to-date and have a complete list of equipment; physical and virtual, 3rd party services from which our products depend on, locations and points of contact.

The goal is to provide viable, effective and economic recovery across all technology domains.

Each product group should create the list of assets and can use the following chart to classify them:

Classification Description
1 Mission Critical Fundamental to accomplishing the mission of the company.

Can be performed only by computers.

No alternative manual processing capability exists.

Must be restored within 36 hours

2 Critical Central in accomplishing the work of the company.

Primarily performed by computers.

Can be performed manually for a limited time period.

Must be restored starting at 36 hours and within 5 days.

3 Essential Must have to complete the work of the company.

Performed by computers.

Can be performed manually for an extended time period.

Can be restored as early as 5 days, however it can take longer.

4 Non-Critical Preferred for accomplishing the mission of the company.

Can be delayed until damaged service is restored and / or a new computer system implemented.

Can be performed manually.

Disaster Recovery Site

If the multi-zone site in Amazon becomes unavailable, Amazon will automatically migrate through coded applications, the Watching That Platform to another location and/or geographical location.

Block level backups are kept in real time for 24 hours with full nightly backups kept for 90 days.

Disaster Recovery Process

1.1 Notification Phase

This phase includes the activities to notify the Disaster Recovery Executive of a possible disaster, directing the Disaster Recovery Team to assess the damages to the Watching That Platform and beginning the Disaster Recovery process if necessary.

1.1.1 Disaster Recovery Procedure

  1. The DevOps Team member will determine, if possible, the nature and impact of the incident:
    1. Open AWS Service Health Dashboard in order to identify the status of key valued AWS services used for the Watching That Platform production environment functionality:
      1. AWS ELB
      2. AWS Route 53
      3. AWS EC2
      4. AWS S3
      5. Etc
  2. The DevOps Team member will notify key personnel of the incident. At a minimum, the following personnel must be notified:
    1. CTO
    2. CEO
    3. COO
    4. Director of Customer Success
  3. The DevOps Team member will include the following notification information if applicable:
    1. Nature of the emergency
    2. Loss of life or injuries
    3. Known damage
    4. Key system status
  4. The Disaster Recovery Executive will determine how to proceed. The following actions may be taken:
    1. Require the Disaster Recovery Team to conduct a further damage assessment. Information on the following items will be reported back to the Disaster Recovery Executive hourly:
      1. Cause of the emergency or disruption
      2. Potential for additional disruptions
      3. Physical infrastructure status
      4. Items impacted / requiring replacement
      5. Estimated time to restore to normal services
    2. Determine the extent of the incident and is it considered a Disaster Recovery Event as defined by the company Disaster Recovery Guidelines. If so, the Disaster Recovery Executive will instruct the Disaster Recovery Team to activate the Platform Disaster Recovery Plan.

1.2 Activation Phase

This phase includes activities that initiate the Disaster Recovery Event. Team members are notified, assembled and updated on the present situation.

1.2.1 Procedure

  1. The DevOps Team member will contact the CTO. If not reachable, the DevOps Team member will contact the CEO.
  2. The Disaster Recovery Executive will contact the Director of Customer Success to brief them on the present situation.
  3. The Director of Customer Success will follow departmental procedures, contact the impacted customers and provide them with information on estimated time to recovery.
  4. The Disaster Recovery Team will make an assessment of the damage at the Production Site, estimated time to recovery and priority of work.
  5. If required, DevOps will make contact with our service providers/vendors to assist and help assess the situation.
  6. The Disaster Recovery Executive will make a determination to declare the situation a Disaster Recovery Event.
  7. The Director of Customer Success will notify all customers about the Disaster Recovery Event.
  8. The Disaster Recovery Team begins the cutover to the Disaster Recovery site if applicable.

1.3 Recovery Phase

The recovery phase involves steps to be taken to restore the Watching That Platform to be recovered to the Disaster Recovery site if applicable.

The initial focus in bringing up the Disaster Recovery environment is to ensure that the data is as current as possible and to determine how much data loss there is between the most recent Product-Disaster Recovery data synching and the time Production was lost.

It is important, throughout the entire Disaster Recovery process, to have the engineering leads available. Work to get the Watching That Platform live will be performed with utmost priority and urgency.

1.3.1 Disaster Recovery Step by Step Procedure

  1. The Director of Customer Success will contact customers and inform them that Watching That is in the process of moving the Watching That Platform to the Disaster Recovery Site (if applicable).
  2. The DevOps Team member will ensure the standard outages pages are in place.
  3. The DevOps Team member will bring the databases and applications online in the Disaster Recovery environment.
  4. The DevOps Team member of the Disaster Recovery Team will reroute all data collections from the Watching That Platform to Disaster Recovery.
  5. The DevOps Team member will verify that all servers are secure and up to date.
  6. The Disaster Recovery Team will work with QA to conduct functionality tests to ensure the Disaster Recovery Site is operational.
  7. Watching That Platform production is cut over to the Disaster Recovery Site.
  8. The Disaster Recovery Executive will contact the Director of Customer Success and inform them that the Production Site has been successfully failed over to the Disaster Recovery Site.
  9. The Director of Customer Success will contact customers and inform them that Watching That has successfully moved Production to the Disaster Recovery Site.
  10. The Disaster Recovery Executive declares the disaster is over once Watching That Platform Product site (original or alternative) is running at 100% operational capacity.
  11. The Watching That Risk Management Group / Disaster Recovery Executive will work with Watching That’s insurance vendor to recoup the funds to assist with paying for replacement equipment or services (if necessary).

Recovery Time Objective (RTO)

Watching That’s AWS Hosted Production Environment RTO should be no more than 30 minutes in scope of the AWS Region.

Recovery Point Objective (RPO)

Watching That’s AWS Hosted Database RPO should be no more than 4 hours.

Disaster Recovery Annual Test Plan

Annually recurring for the first quarter of the year and incorporating a full mock fail-over up to but not including the point of redirecting customers.

Platform Disaster Recovery Plan Maintenance Process

The Platform Disaster Recovery Plan and Disaster Recovery environment will be modified in response to changes in the Watching That Platform environment. Such changes might include personnel changes, critical application changes and network, hardware or software changes. Watching That’s Platform Disaster Recovery Plan is tested annually to ensure that Watching That has the appropriate environment to support Disaster Recovery at 100% capacity.

The Disaster Recovery test is designed to ensure that the Disaster Recovery data is in sync with Production data and the Disaster Recovery applications functions the same as Production applications.