...
Disaster Recovery Plan
...
CloudCard
...
In the event of a disaster, companies must act quickly and decisively. The goal of this document is to establish a trusted plan in preparation for any future disasters. This plan will act as a guide for CloudCard to follow. CloudCard utilizes Amazon Web Services to store all of its data. This strategic move allows CloudCard to quickly launch resources in Amazon Web Services (AWS) to ensure business continuity. This plan highlights the usage of AWS services and features that CloudCard plans to leverage if disaster strikes, significantly minimizing the impact on data, system, and overall business operations. Though AWS supports multiple strategies, CloudCard has chosen a “Pilot Lite” strategy in order to mitigate the risk of data loss in the event of a disaster.
...
PLAN OBJECTIVES
PLAN SCOPE
SERVICE RPO AND RTO TARGETS
BACKUP STRATEGY
AWS PILOT LITE PLAN & STRATEGY
PLAN REVIEW
REVISION HISTORY
ROLES AND RESPONSIBILITIES
EXTERNAL CONTACTS
INCIDENT RESPONSE
DR PROCEDURES
APPENDICES
This document details the policies and procedures of Cloud Card LLC in the event of a disruption to critical IT services or damage to IT equipment or data. These processes will ensure that those assets are recoverable to the right level and within the right timeframe to deliver a return to normal operations, with minimal impact on the business.
...
To quickly respond in the event of a natural disaster
To effectively respond in the event of a natural disaster
To mitigate/prevent data loss in the event of a natural disaster
...
Amazon Web Services
AWS data centers
AWS data storage
Amazon S3
Amazon Route 53
Amazon Machine Images
Amazon Elastic Beanstalk
Amazon Elastic Load Balancing
DNS records
Shentel
Cricket Wireless
Primary business operations
Primary business headquarters
Assignment of roles for disaster response personnel
Plans and procedures
Critical tasks checklist
...
The term pilot light refers to a DR scenario in which a minimal version of an environment is always running in the cloud. With AWS, CloudCard maintains a pilot light by configuring and running the most critical, core elements of its system. When the time comes for recovery, we will rapidly provision a full-scale production environment around the critical core.
Infrastructure elements for the pilot light itself include our Amazon RDS database servers, which are replicated to a different availability zone using a multi-AZ deployment as well as our Amazon S3 files, which are stored redundantly across multiple availability zones (data centers) to preserve data. This is the critical core of the system (the pilot light) around which all other infrastructure pieces in AWS can quickly be provisioned to restore the complete system.
To provision the remainder of the infrastructure to restore business-critical services, CloudCard will provision new Elastic Beanstalk environments in the new availability zone using standard Amazon Machine Images (AMIs), which are ready to be started up at a moment’s notice. When starting recovery, instances from these AMIs come up quickly with their predefined role (for example, Web or App Server) within the deployment around the pilot light. From a networking point of view, Elastic Beanstalk automatically configures Elastic Load Balancing (ELB) to distribute traffic to multiple app servers. We will then update our Route 53 DNS records to point at our our load balancers.
...
● The DR plan itself will be formally reviewed once every 12 months and in response to a regular test
...
IT service
...
Scenario
...
RPO
...
RTO
...
Priority
...
AWS
...
Amazon data center failure or destruction
...
<2 hours
...
<24 hours
...
Highest
...
IT service
...
Backup location
...
Backup frequency
...
AWS Pilot Lite
...
Separate AWS data center
...
Continually
...
Version
...
Date
...
Revision details
...
The following individuals are to assume responsibility for restoring IT services when the DR plan is activated:
...
Name
...
Job role
...
Contact details
...
DR process owned
...
Anthony Erskine
...
Product Owner and IT lead
...
Phone: (434) 248-0444
Email:
tony@onlinephotosubmission.com
...
Completion of the Pilot Lite plan
...
Luke Rettstatt
...
Managing Director
Phone:
Version 1.3
...
Quick Reference
If you believe a disaster has occurred that will affect our business, contact the Managing Director or the Principal Engineer.
...
Key contacts | CloudCard Support
Luke Rettstatt / Managing Director
|
Primary point of contact for all customer questions.
...
Anthony Erskine / Principal Engineer
| |
Key Locations | Primary Office: 1103 Wise Street Lynchburg, VA 24504 Online Meeting Room: https://onlinephotosubmission.com/meeting |
External contacts
Name | Organization | Contact details | DR process owned |
Todd Brooks | Color ID | Phone: (704) 897-1959 Email: | Communicating |
status and process to current customer base. | ||
Zack Walker | Vision Database Systems | Phone: (561) 386-1534 |
Communicating DR process to current customer base.
...
Contents
Table of Contents | ||||||
---|---|---|---|---|---|---|
|
Purpose
This document defines how CloudCard will respond to a disaster affecting our ability to serve our customers. The goal of the Disaster Recovery Plan is to restore services to the widest extent possible in the shortest possible time, while ensuring security and compliance are maintained.
Scope
A disaster for the purposes of this plan is defined as any event that causes prolonged unavailability of one or two AWS Availability Zones in the CloudCard’s primary operating region.
The following events are excluded from the scope for this plan:
Loss of availability of the entire AWS region (large scale events of this sort will be responded to on a case-by-case basis).
Loss of availability of CloudCard’s offices (see Business Continuity Plan)
Loss of availability of a production application or service necessary to CloudCard’s operations that either (a) does either not affect all of CloudCard’s core services, or (b) is short-lived (outage lasting less than 4 hours) (see Incident Response Plan)
Security breaches (see Incident Response Plan)
Policy
In the event of a disaster causing a major disruption to CloudCard’s production services, the person discovering the disaster must notify the Managing Director. The Managing Director will review the situation in consultation with the Principal Engineer and determine next steps. If the disaster falls within the scope above, the Managing Director should activate this Disaster Recovery Plan and follow the checklist appropriate for the given scenario (see Appendix: Scenarios).
Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all employees.
Review
This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios.
Activation
This Disaster Recovery plan is to be activated when one or more of the following criteria are met:
● The An Amazon data center in which CloudCard stores its data is destroyed unavailable or is in imminent danger of being destroyedof becoming unavailable for an extended period of time.
The person discovering the incident potential disaster must notify the following DR stakeholders, who collectively assume responsibility for deciding which - if any - aspects of the DR plan should be implemented, and for establishing communication with employees, management, partners and customers.
● First point of contact - Anthony Erskine (contact details listed above)
● Second point of contact - Luke Rettstatt (contact details listed above)
...
In the event of severe damage to the AWS data center in which CloudCard stores its data, the following plan will be executed.
...
Managing Director (contact details listed above). If the Managing Director is unavailable, the Principle Engineer must be notified instead.
Communications Processes
Once notified of a potential disaster, if the Managing Director activates this Plan, the Managing Director will direct the Principal Engineer and all relevant employees to convene in the CloudCard Meeting Room (https://onlinephotosubmission.com/meeting). This online meeting room will be used as the primary mechanism to coordinate action and internally communicate status updates.
If the CloudCard Meeting Room is unavailable, the Managing Director will arrange an alternate digital or physical meeting room and communicate the location to the Principal Engineer and all relevant employees.
Roles and Responsibilities
Person | Roles | Responsibilities |
---|---|---|
Managing Director | Coordination and Communication | Determine activation of plan Coordinate employee response Communicate status internally and externally Review and Test plan annually |
Principal Engineer | Technical Execution; Alternate for Managing Director | Ensure all failovers complete smoothly Deploy new infrastructure to replace failed infrastructure where necessary. Review and Test plan annually |
Revision History
Document Revision History is maintained in Confluence. Changes to this document should be made with comments indicating the reason for the change. To do this, use the menu and click “Publish with Version Comment” option in Confluence.
Change History |
---|
This document was imported to Confluence from Google Docs on 2/24/2023. Below is the history of the document from Google Docs:
Version 1.1 - October 2018 - initial Google Docs version
Version 1.1 - November 2019 - clarity and accuracy updates
Version 1.2 - March 2021 - updates to contact details.
Version 1.2 - February 8, 2023 - accuracy update.
Version 1.2 - February 24, 2023 - migrated to confluence as v. 1 (see above).
Appendices
Appendix: Disaster Recovery Strategies
AWS Multi-site Active-Active Strategy
Application load is distributed across multiple resources located in two or more physical locations (AWS Availability Zones). If one Availability Zone becomes unavailable, resources are automatically or manually provisioned in the healthy Availability Zone to handle the load from the first zone.
Specific resources following these strategies:
RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.
In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.
Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone until the load from the users was met.
S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )
Anchor | ||||
---|---|---|---|---|
|
Single AZ failure
Plan of Action
Identify issue, coordinate initial response (
...
Managing Director)
Contact Amazon to
...
determine extent
...
Evaluate damage
...
unavailability (Managing Director)
Evaluate unavailability timeframe and impact (Managing Director)
Monitor service Failover (Principal Engineer)
Ensure database failover occurs; Add additional read replica if needed.
Ensure auto scaling replaces lost services with new nodes.
Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.
Determine if any secondary services are down, and recover them.
Contact External Organizations to make them aware of the situation (
...
Managing Director)
...
Determine service and data recovery
...
timeframes (
...
Principal Engineer)
Share
...
timeframes with customers and external
...
organizations (Managing Director)
Improve upon the process in case of a future disaster
...
Key contacts
...
Anthony Erskine / Product Owner & Lead IT specialist
Phone: (434) 248-0444
Email: tony@onlinephotosubmission.com
Luke Rettstatt / Managing Director
Phone: (434) 253-5657
Email: luke@onlinephotosubmission.com
...
The appendices to your DR plan may include the following:
...
Address
...
Contact
...
915 11th Street
Lynchburg, VA 24504
...
Luke Rettstatt
...
1103 Wise Street
Lynchburg, VA 24504
...
(Managing Director and Principal Engineer)
Two AZ failure
Plan of Action
Identify issue, coordinate initial response (Managing Director)
Contact Amazon to determine extent unavailability (Managing Director)
Evaluate unavailability timeframe and impact (Managing Director)
Instruct with Principal Engineer to restore services from backups (Managing Director)
Contact External Organizations to make them aware of the situation (Managing Director)
Determine service and data recovery timeframes (Principal Engineer)
Share timeframes with customers and external organizations (Managing Director)
Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)
Out of Scope Scenario Examples
Hurricane causes power outage in Lynchburg VA - Business Continuity
Snow makes impossible for employees to commute - Business Continuity
Outage to our email and business application service (Google Apps) - Business Continuity
Entire Region down - Catastrophic event, handled on an as-needed basis
Employee deletes a large number of resources in AWS - Catastrophic event, handled on an as-needed basis
Employee deletes a single database server or other core resource - Incident
Appendix: Asset RTO and RPO
Priority | Asset | Scenario | Recovery Time Objective (RTO) | Recovery Point Objective (RPO) |
1 | AWS data and services | Amazon data center failure or destruction | < 1 hour | < 1 hour |
Appendix: Test Plan
The Managing Director and Principal Engineer will meet with all other relevant employees for the following:
Read through the plan and address any questions.
For each of the scenarios defined in Appendix: Scenarios, craft an example of that scenario, and walk through how the plan would be implemented in that scenario. Document the estimated time taken for each action; including failures to follow the plan that are discovered later in the conversation. For actions that can be simulated, note those actions for later simulation and continue the walkthrough.
Simulate the actions noted in step 2, and add the actual RPO and RTO achieved during these simulations to the walk-through notes. These actions should include (but are not limited to):
Test failover of database to another availability zone and adding a new read replica to the cluster.
Test scaling up the application cluster to introduce new servers in a different availability zone to replace others lost in the outage. Ensure that all availability zones in the region can be used by the cluster.
Test deploying a completely new database cluster from a database backup.
Test deploying a completely new application cluster.
Perform an after action review - collect all suggestions from all those included in the test for review.
Document the test results and after action review notes.
Update this Plan based on the results and suggestions.
Appendix: Planned Improvements
Use cross region replication of data to sustain wide regional disasters (while ensuring data residency compliance). For our Canadian customers, this will be implemented as the AWS Canada West (Calgary) Region becomes available.
Update plan to address human error disasters.