...
Effective Date: 2023-05-01
TODO:
appendix - scenarios
Quick Reference
If you believe a disaster that will affect our business has occurred or is imminent, contact the Managing Director or the Principal Engineer become aware of an imminent or active disaster that affects CloudCard, immediately contact the Customer Support team as soon as possible, via phone call if possible.
If a disaster has occurred, please contact your supervisor or the Customer Support team as soon as possible to let them know whether or not if you are safe and able to continue working. CloudCard wants to ensure the safety of our employees at all times, and does not want to assign tasks to employees who need time to ensure their own safety or the safety of others.
...
Key contacts
CloudCard Support
...
.
Key Locations
Primary Office:
1103 Wise Street, Lynchburg,Key contacts | CloudCard Support
|
Luke Rettstatt / Managing Director
Phone: (434) 253-5657
Anthony Erskine / Principal Engineer
Phone: (434) 248-0444
Key Locations | Primary Office: 1103 Wise Street, Lynchburg, VA 24504 Online Meeting Room: |
Contents
Table of Contents | ||||
---|---|---|---|---|
|
Anchor | ||||
---|---|---|---|---|
|
The plan defines how CloudCard will respond to situations that prevent normal operations and service delivery for prolonged periods of time. The goal of the plan is to restore CloudCard wants to maintain services and support for our customers and ability to work for our employees as far as possible. When disasters happen that affect our ability deliver services, support customers, do our work, or basic safety, we need to be prepared to respond. The goal of this plan is to ensure the safety of our employees and restore services and operations to the greatest extent possible in the shortest possible time, while maintaining security and compliance.
...
This plan provides guidance for responses to significant detrimental events, but is not intended to document daily problem resolution procedures.
Anchor | ||||
---|---|---|---|---|
|
All business critical IT Systems, especially those systems providing services to customers or facilitating customer support communications.
Any event that causes prolonged degradation of CloudCard services, or the inability of CloudCard employees to perform core business functions (customer support, operation of services, security of operations). The goal of this plan is to guide and define responses to significant detrimental events, not to document daily problem resolution procedures.
This policy applies to all employees of CloudCard and to all relevant external parties, including but not limited to CloudCard consultants and contractors.
...
In the event of a major disruption to production services, or a disaster affecting either the business critical systems used for CloudCard operations, or a disaster affecting the safety, security or ability to work of a significant number of CloudCard’s employees, the Managing Director shall diagnose the issue situation and direct mitigating actions.
Appendix: Diagnostic Steps provides guidance on how to gather contextual information necessary to determine the appropriate mitigating actions to take.Mitigating actions shall be preplanned where possible determining what is affected by the disaster.
Where possible, mitigating actions are prepared and described in scenario-specific action plans in Appendix: Scenarios. Certain large-scale disasters (e.g. loss of availability for an entire AWS Region) may not have a predefined action plan. For these disastersThe Managing Director will follow these prepared action plans when appropriate. For situations that have not been preplanned, the Managing Director will coordinate mitigating actions in consultation with the Principal Engineer, using as many predefined components of this policy as practical given the situation..
Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all relevant employees.
The following factors are to be considered in planning mitigations:
Employee Safety
Continuity of information security
...
Continuity of compliance
Continuity of operations.
In the case of an information security event or incident, refer to the Incident Response Plan.
...
This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios. The agenda for the a testing exercise should be maintained in Appendix: Test Plan.
Activation
This Business Continuity and Disaster Recovery plan is to be activated when one or more of the following criteria are met:
An Amazon data center in which CloudCard stores its data is unavailable or is in imminent danger of becoming unavailable for an extended period of time.
A significant number of CloudCard employees are unable to work or in imminent danger of being unable system supporting a core CloudCard business function is unavailable or is in imminent danger of becoming unavailable for an extended period of time.
A significant number of CloudCard employees are unable to work or in imminent danger of being unable to work for an extended period of time.
...
If employee safety is threatened, the first action should be to communicate with all employees to ensure their safety (see Appendix: Employee Safety Confirmation Process).
Anchor | ||||
---|---|---|---|---|
|
If the CloudCard office becomes unavailable due to a disaster, all staff should work remotely from their homes or any safe location. Similarly, if an employee’s home office becomes unavailable or unsafe, the employee should first seek safety, and once safe, work from the office or find a safe alternate work location.All tools and processes used to conduct regular operations at CloudCard should be conducive to remote work to the greatest extent possible. For
If necessary, the Managing Director should procure a temporary work location (e.g. coworking space membership) and accommodations in a location unaffected by the disaster so that affected employees can continue working.
All tools and processes used to conduct regular operations at CloudCard should be conducive to remote work to the greatest extent possible. For example, the use of web applications over encrypted channels is preferred to private server applications that require users to be on the network. Security controls should assume and account for remote work.
...
Strategy for maintaining continuity of services:
KEY BUSINESS PROCESS | CONTINUITY STRATEGY |
Customer Service Delivery | Rely on AWS availability commitments and SLAs; use multi-site active active, with cross-region backups where possible. |
IT Operations | Use SaaS applications or AWS hosted applications to ensure operations do not depend on a single physical location and are conducive to remote work arrangements. |
Utilize Gmail and its distributed nature, rely on Google’s standard service level agreements. | |
Customer Support | All systems are vendor-hosted SaaS applications, use Gmail as communications channel if helpdesk is down. |
Finance, Legal and HR | All systems are vendor-hosted SaaS applications. |
Sales and Marketing | All systems are vendor-hosted SaaS applications. |
Anchor | ||||
---|---|---|---|---|
|
Person | Roles | Responsibilities |
---|---|---|
Managing Director | Coordination and Communication | Determine activation of plan Coordinate employee response Coordinate communication of status internally and externally Prioritize activities to ensure safety, security, and core services are maintained or restored as soon as possible Work with Customer Support team to ensure employee safety Review and Test plan annually Ensure pizza is provided |
Principal Engineer | Technical Execution; Alternate for Managing Director | Provide technical guidance on mitigating actions to the Managing Director Ensure all failovers complete smoothly Deploy new infrastructure to replace failed infrastructure where necessary. Review and Test plan annually Designate and brief alternate person in case of unavailability. |
Customer Support Team | Communication | Communicate with employees to ensure safety |
Monitor employee safety |
status Communicate status to customers and resellers Handle questions from customers and resellers | ||
Engineering Team | Technical Execution | Support Principal Engineer as needed to recover services |
Revision History
Note - prior to April 2023, CloudCard had a Business Continuity Plan, which was separate from the Disaster Recovery Plan. At the end of March 2023, we merged the two plans as part of our SOC 2 Compliance preparations.
...
Version
...
Date
...
Description
Version | Date | Description | Author | Approved by |
1.1 (Disaster Recovery Plan) | October 2018 | Initial Plan | ||
1.1 (Disaster Recovery Plan) | November 2019 | Clarity and Accuracy Updates | ||
1.2 (Disaster Recovery Plan) | March 2021 | Updates to Contact Details | ||
1.2 (Disaster Recovery Plan) | February 2023 | Accuracy Update | ||
1.3 (Disaster Recovery Plan) | March 2023 | Updated to reflect Active-Active AWS strategy | ||
1.4 (Disaster Recovery Plan) | March 2023 | Improved based on results of testing of plan | ||
2.0 (Business Continuity and Disaster Recovery Plan)* | 2023-03-29 | Merged Disaster Recovery with Business Continuity | Ryan Heathcote | Luke Rettstatt |
...
2.1 | 2024-07-20 | Updates from review of annual test | Ryan Heathcote | Luke Rettstatt |
* Note - prior to April 2023, CloudCard had a Business Continuity Plan, which was separate from the Disaster Recovery Plan. At the end of March 2023, we merged the two plans as part of our SOC 2 Compliance preparations.
Anchor | ||||
---|---|---|---|---|
|
Appendix: Disaster Recovery Strategies
...
RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the Additionally, Aurora databases use a separate redundant storage layer independent of the servers that is distributed across all availability zones in a region. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.
In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.
New servers can be established from the storage layer, typically in less than 30 minutes.
Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone(s) until the load from the users was met.
S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (from https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )
...
If the disaster affects the Virginia region:
consult news sources to gather information on impact of disaster
If employee safety could be affected, immediately direct the customer support team to confirm the safety of each employee, so that employee safety status information is available as soon as possible.Customer Support Team to follow the Appendix: Employee Safety Confirmation Process.
Attempt to log into CloudCard
Attempt to log into the AWS console
Check the CloudCard Internal Status Dashboard (an internal site which will be known to relevant employees which includes information on CloudCard system health and relevant AWS Status feeds)
Determine if CloudCard systems are experiencing downtime
Determine if AWS has published any notices.
Attempt to log into the AWS console
Observe the state of the database and application environments:
Are the major components (autoscaling functionality, rds RDS cluster) still operational?
Is autoscaling and failover functioning normally and recovering the services?
Is the service recovery trending towards normal within less than 15 minutes?
...
When an event could affect employee safety, CloudCard will confirm the safety of employees using this process.
The Managing Director, or the Customer Support team at the direction of the Managing Director, will post a message in the company-wide Marco Polo group, describing the situation and asking each employee to respond with their status.
Employees will report back their status.
The Customer Support team will monitor the group to ensure all employees report back.
The Customer Support team will follow up with any employee who does not respond quickly via alternative communications channels.
...
If an employee is not safe, CloudCard will attempt to provide that employee with resources (information or help) to assist in getting them to safety where possible, and continue to monitor the situation
...
.
Anchor | ||||
---|---|---|---|---|
|
CloudCard uses SaaS / Cloud applications for most business critical functions CloudCard’s production operations are hosted using AWS aws services with auto scaling and multi-site redundancy, and incremental continuous backups. Therefore our strategies for disasters in the cloud center around making sure failovers happen correctly, and creating new application environments from data backups when necessary.
CloudCard uses SaaS / Cloud applications for most business critical functions so that all employees should be able to perform their work from any safe location that has power and a stable internet connection. Business continuity is therefore focused on (a) ensuring employee safety and (b) getting enough staff to a connected alternate work location to continue serving customers.
CloudCard’s production operations are hosted using AWS aws services with auto scaling and multi-site redundancy, and incremental continuous backups. Therefore our strategies for disasters in the cloud center around making sure failovers happen correctly, and creating new application environments from data backups when necessary.
These scenarios also assume that there exists a safe These scenarios also assume that there exists a safe travel channel for a minimal number of employees to take to reach a safe, internet-connected working location (if their home does not qualify), and that it is safe for the employee to leave their home. Employees should establish safety for themselves and their families / household prior to returning to work.
If a situation is so severe that there it is no way not possible for even a minimal number of CloudCard staff to safely relocate to an internet-connected work location, we assume that it is immaterial for CloudCard to continue operations. For example, in the case of a a natural disaster destroying power and network infrastructure across the entire Eastern and Central United States , CloudCard’s ability to continue operations, and the meaningfulness of those operations. In such situations- in this situation, very few people will be connected to the internet at all, and so CloudCard’s employees should focus on finding safety and taking care of others . Once until power or network infrastructure is restored to a sufficient extent , CloudCard should be able to that CloudCard can resume continuity efforts according to one of the Scenarios below.
Anchor | ||||
---|---|---|---|---|
|
Plan of Action
Assemble team in the appropriate meeting room (Managing Director)
Pray
Monitor service Failover and deploy backup infrastructure (Principal Engineer)
Ensure database failover occurs; Add additional read replica if needed.
Ensure auto scaling replaces lost services with new nodes.
If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.
Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.
Determine if any secondary services are down, and recover them.
Determine service and data recovery timeframes (Principal Engineer)
If service is likely to be degraded for more than 15 minutes, Direct Communications Team to contact Customers and Resellers to make them aware of the situation (Managing Director)
Update the service updates page and direct customers to review it for updates: https://onlinephotosubmission.com/service-updates
Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)
Outages of Business Critical Services
Google Workspaces Email unavailable:
...
Update the CloudCard service updates page to indicate outage.
...
Customer Support team communicate with customers via HelpScout.
...
Data / Infrastructure Sabotage or Human Error
In this scenario the assumption is that an attacker or an employee has intentionally or accidentally tampered with production resources to such an extent as to cause a major outage.
Triage - determine the actor causing the sabotage
If more appropriate, follow the Incident Response Plan.
Perform containment to ensure further actor access or action is prevented.
Follow the steps in Disasters affecting AWS Availability Zone(s) to restore services.
Take appropriate legal, disciplinary or training action.
Outages of Business Critical Services
Google Workspaces Email unavailable:
Review news from Google to determine timeline for restoration of services.
As a last resort, if service is unlikely to Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers via Active Campaign.
As a last resort, if service is unlikely to be restored for a significant period, and other email providers are fine, set up temporary or permanent operations on a different email provider (Microsoft Office 365) and repoint DNS records.
HelpScout unavailable:
Review news from HelpScout to determine timeline for restoration of services.
Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers via Google Workspaces Email.Review news from HelpScout to determine timeline for restoration of services
Log into the Google Support account and review emails with the Support label.
As a last resort, if service is unlikely to be restored for a significant period, and other help desk providers are fine, set up temporary or permanent operations on a different helpdesk provider.
SquareSpace unavailable:
Set up a simple html static site in S3 and repoint dns for the website to the static site.
Customer Support team monitor HelpScout for customer questions.
Review news from SquareSpace to determine timeline for restoration of services.
As a last resort, if service is unlikely to be restored for a significant period, and other website hosting provider are fine, set up temporary or permanent operations on a different website hosting provider, or rebuild the website on the S3 static site.
Sales / Accounting / Task Tracking unavailable:
These services do not affect CloudCard’s immediate ability to serve customers. If they become unavailable, staff should use spreadsheets or manual systems to track information until the system comes back online, or establish an alternative provider.
AWS service (but not a whole AZ or region) unavailable:
Where possible, restore service in another region. (ensure cross-region replication of data where possible to facilitate recovery, as outages of a service are likely to be localized to a single region).
Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers.
Review news from AWS to determine timeline for restoration of services.
Disasters affecting the CloudCard Office
This scenario assumes that Employee home offices are unaffected by the disaster and safe to work from and connected to the internet. For example, a fire in the office.
Ensure employee safety. Evacuate the building or area if necessary.
If safe to do so, Employees at the office relocate to home offices (30-60 minutes)
Verify internet connectivity at home offices (10 minutes)
Remotely resume normal operations
Disasters affecting the greater Lynchburg, VA area
...
Sales / Accounting / Task Tracking / SquareSpace unavailable:
These services do not affect CloudCard’s immediate ability to serve customers. If they become unavailable, staff should use spreadsheets or manual systems to track information until the system comes back online. If the outage is likely to be prolonged, CloudCard should seek another service provider.
Disasters affecting the CloudCard Office
Assumptions:
Employee home offices are unaffected by the disaster, safe to work from, and connected to the internet.
Plan of Action:
Ensure employee safety. Evacuate the building or area if necessary.
If safe to do so, employees at the office relocate to home offices.
Verify internet connectivity at home offices.
Remotely resume normal operations.
Disasters affecting the greater Lynchburg, VA area
Assumptions:
The CloudCard office is unavailable
Most employees home offices are affected by the disaster
Some locations within 1 hour driving time of Downtown Lynchburg are unaffected
At least some of the affected employees can safely commute to an unaffected location
Plan of Action:
Ensure employee safety. Evacuate to a safe location if necessary.
Managing Director finds and rents a coworking space or other working location within 1 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for affected employees. If necessary, multiple such locations could be established.
Employees work from the coworking space until their home office or the CloudCard office becomes available.
Disasters affecting most of Central Virginia
Assumptions:
The CloudCard office is unavailable
Most employees home offices are affected by the disaster
...
The entire region within at least 1 hour driving time of Downtown Lynchburg
...
is affected by the disaster.
Some locations within 8 hours driving time of Downtown Lynchburg are unaffected
At least some employees can safely commute to an unaffected location.
Plan of Action:
Ensure employee safety. Evacuate to a safe location if necessary.
Managing Director finds and rents a coworking space or other working location within 1 , and appropriate hotel space, within an 8 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for affected employees. If necessary, multiple such locations could be established.
Employees work from the coworking space until their home office or the CloudCard office becomes available.
Disasters affecting most of Central Virginia
...
at least some employees.
A minimal group of employees is coordinated to stay at the hotel and work at coworking space.
Where possible, a rotational model will be established so that employees are able to return to their families frequently and are not burned out.
Major Disasters affecting multiple states surrounding Virginia
Assumptions:
The CloudCard office is unavailable
Most employees home offices are affected by the disaster
...
The entire region within at least
...
8 hours driving time of Downtown Lynchburg is affected by the disaster.
...
It is therefore impossible to find a location within 8 hours driving time of Downtown Lynchburg that
...
is safe to work from, connected to the internet, and that at least some employees can
...
commute to
...
Ensure employee safety. Evacuate to a safe location if necessary.
Managing Director finds and rents a coworking space or other working location, and appropriate hotel space, within an 8 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for at least some employees.
A minimal group of employees is coordinated to stay at the hotel and work at coworking space.
Where possible, a rotational model will be established so that employees are able to return to their families frequently and are not burned out.
Major Disasters affecting multiple states surrounding Virginia
If it is impossible to find a location within 12 hours driving time of Downtown Lynchburg that is safe to work from, connected to the internet, and that at least some employees can commute to, we assume that it is immaterial for CloudCard to focus on continuity efforts at this time. The Managing Director should monitor the situation until a safe location becomes available, and maintain regular communications with employees to whatever degree possible to ensure their safety and arrange help where possible.
DDOS
Data / Infrastructure Sabotage
Death or incapacitation of key leader
Response:
It is immaterial for CloudCard to focus on continuity efforts at this time.
The Managing Director should monitor the situation until a safe location becomes available
The Managing Director should maintain regular communications with employees to whatever degree possible to ensure their safety and arrange help where possible.
Distributed Denial of Service (DDoS) Attacks
AWS provides base level protection against DDoS and similar attacks. If a situation becomes more severe than the built-in AWS protection provides, Contact AWS support for assistance in dealing with the situation.
Death or incapacitation of key leader
CloudCard’s key leaders are the Managing Director and the Principal Engineer.
CloudCard’s key leaders should each designate another employee as an alternate. The alternate should be briefed on the responsibilities of the given role and able to perform interim responsibilities in case of death or incapacitation, or planned absence of the key leader.
After a key leader takes time off from work, an after-action review should be performed to determine what gaps exist in the knowledge possessed by the alternate to perform interim key leader responsibilities.
All employee roles and responsibilities should be documented to enable other employees or new hires to assume responsibility.
Small Scale Events that are out of scope
...
Loss of connectivity for a single employee
Laptop failure for a single employee.
Loss of availability of a production application or service necessary to CloudCard’s operations that either (a) does either not affect all of CloudCard’s core services, or (b) is short-lived (outage lasting less than 4 hours) (see Incident Response Plan)
Anchor | ||||
---|---|---|---|---|
|
Asset | Scenario | Recovery Strategy | Recovery Time Objective (RTO) | Recovery Point Objective (RPO) |
AWS Data and Services | Amazon data center failure or destruction | Autoscaling, failover, or restoration of backups | < 1 hour | < 1 hour |
Main Office | Major utility Outage | Alternate work location | < 1 hour | < 1 hour |
Employee Home Offices | Major utility outage | Alternate work location | < 12 hours | < 12 hours |
Google Workspaces | Major service outage | Rely on Google SLAs | ||
HelpScout | Major service outage | Use Gmail until service restored |
Anchor | ||||
---|---|---|---|---|
|
...
Read through the plan and address any questions.
Test the employee safety confirmation process.
For each of the scenarios defined in Appendix: Scenarios, craft an example of that scenario, and walk through how the plan would be implemented in that scenario. Document the estimated time taken for each action; including failures to follow the plan that are discovered later in the conversation. .
For technical actions that can be simulated, note those actions for later simulation and continue the walkthrough.
Simulate the actions noted in step 2during the walkthrough, and add the actual RPO and RTO achieved during these simulations to the walk-through notes. These actions should include (but are not limited to):
Test failover of database to another availability zone and adding a new read replica to the cluster.
Test scaling up the application cluster to introduce new servers in a different availability zone to replace others lost in the outage. Ensure that all availability zones in the region can be used by the cluster.
Test deploying a completely new database cluster from a database backup.
Test deploying a completely new application cluster.
Perform an after action review - collect all suggestions from all those included in the test for review.
Document the test results and after action review notes.
Update this Plan based on the results and suggestions.
...