Policy Owner: Managing Director
Effective Date: 2023-05-01
Quick Reference
If you believe a disaster that will affect our business has occurred or is imminent, contact the Managing Director or the Principal Engineer as soon as possible, via phone call if possible.
If a disaster has occurred, please contact your supervisor or the Customer Support team as soon as possible to let them know whether or not you are safe and able to continue working. CloudCard wants to ensure the safety of our employees at all times, and does not want to assign tasks to employees who need time to ensure their own safety or the safety of others.
Key contacts | CloudCard Support
Luke Rettstatt / Managing Director
Anthony Erskine / Principal Engineer
|
Key Locations | Primary Office: 1103 Wise Street, Lynchburg, VA 24504 Online Meeting Room: |
Contents
Purpose
The plan defines how CloudCard will respond to situations that prevent normal operations and service delivery for prolonged periods of time. The goal of the plan is to restore services and operations to the greatest extent possible in the shortest possible time, while maintaining security and compliance.
Scope
All business critical IT Systems, especially those systems providing services to customers or facilitating customer support communications.
Any event that causes prolonged inability of CloudCard employees to perform core business functions (customer support, operation of services, security of operations). The goal of this plan is to guide and define responses to significant detrimental events, not to document daily problem resolution procedures.
This policy applies to all employees of CloudCard and to all relevant external parties, including but not limited to CloudCard consultants and contractors.
Policy
In the event of a major disruption to production services, or a disaster affecting either the business critical systems used for CloudCard operations, or a disaster affecting the safety, security or ability to work of a significant number of CloudCard’s employees, the Managing Director shall diagnose the issue and direct mitigating actions.
Appendix: Diagnostic Steps provides guidance on how to gather contextual information necessary to determine the appropriate mitigating actions to take.
Mitigating actions shall be preplanned where possible and described in scenario-specific action plans in Appendix: Scenarios. Certain large-scale disasters (e.g. loss of availability for an entire AWS Region) may not have a predefined action plan. For these disasters, the Managing Director will coordinate mitigating actions in consultation with the Principal Engineer, using as many predefined components of this policy as practical given the situation.
Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all relevant employees.
Continuity of information security shall be considered along with operational continuity.
In the case of an information security event or incident, refer to the Incident Response Plan.
Review
This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios. The agenda for the testing exercise should be maintained in Appendix: Test Plan
Activation
This Business Continuity and Disaster Recovery plan is to be activated when one or more of the following criteria are met:
An Amazon data center in which CloudCard stores its data is unavailable or is in imminent danger of becoming unavailable for an extended period of time.
A significant number of CloudCard employees are unable to work or in imminent danger of being unable to work for an extended period of time.
Examples of situations that would cause the above criteria to be met are:
loss of utility service (water, power, heating fuel)
loss of internet connectivity
catastrophic events (weather, natural disaster, vandalism)
The person discovering the potential or actual disaster must notify the Managing Director (contact details listed above). If the Managing Director is unavailable, the Principle Engineer must be notified instead.
Communications Processes
Once notified of a potential disaster, the Managing Director will consult with the Principal engineer and follow appropriate diagnostic steps. Once the disaster has been diagnosed and the plan activated, the Managing Director will direct the Principal Engineer and all relevant employees to convene in the CloudCard Meeting Room (https://onlinephotosubmission.com/meeting). This online meeting room will be used as the primary mechanism to coordinate action and internally communicate status updates.
If the CloudCard Meeting Room is unavailable, the Managing Director will arrange an alternate digital or physical meeting room and communicate the location to the Principal Engineer and all relevant employees.
The Customer Support team will handle proactive communications to customers and resellers, and respond to questions from resellers and customers.
If employee safety is threatened, the first action should be to communicate with all employees to ensure their safety.
Alternate Work Facilities
If the CloudCard office becomes unavailable due to a disaster, all staff should work remotely from their homes or any safe location. Similarly, if an employee’s home office becomes unavailable or unsafe, the employee should first seek safety, and once safe, work from the office or find a safe alternate work location.
All tools and processes used to conduct regular operations at CloudCard should be conducive to remote work to the greatest extent possible. For example, the use of web applications over encrypted channels is preferred to private server applications that require users to be on the network. Security controls should assume and account for remote work.
Continuity of Critical Services
Procedures for maintaining continuity of critical services in a disaster can be found in Appendix: Scenarios.
Recovery Time Objectives (RTO) and Recovery Point Objects (RPO) can be found in Appendix: Asset RPO and RTO.
Strategy for maintaining continuity of services:
KEY BUSINESS PROCESS | CONTINUITY STRATEGY |
Customer Service Delivery | Rely on AWS availability commitments and SLAs; use multi-site active active, with cross-region backups where possible. |
IT Operations | Use SaaS applications or AWS hosted applications to ensure operations do not depend on a single physical location and are conducive to remote work arrangements. |
Utilize Gmail and its distributed nature, rely on Google’s standard service level agreements. | |
Customer Support | All systems are vendor-hosted SaaS applications, use Gmail as communications channel if helpdesk is down. |
Finance, Legal and HR | All systems are vendor-hosted SaaS applications. |
Sales and Marketing | All systems are vendor-hosted SaaS applications. |
Roles and Responsibilities
Person | Roles | Responsibilities |
---|---|---|
Managing Director | Coordination and Communication | Determine activation of plan Coordinate employee response Coordinate communication of status internally and externally Prioritize activities to ensure safety, security, and core services are maintained or restored as soon as possible Work with Customer Support team to ensure employee safety Review and Test plan annually Ensure pizza is provided |
Principal Engineer | Technical Execution; Alternate for Managing Director | Provide technical guidance on mitigating actions to the Managing Director Ensure all failovers complete smoothly Deploy new infrastructure to replace failed infrastructure where necessary. Review and Test plan annually Designate and brief alternate person in case of unavailability. |
Customer Support Team | Communication | Communicate with employees to ensure safety (if the disaster threatens employee safety). Communicate status to customers and resellers Handle questions from customers and resellers |
Engineering Team | Technical Execution | Support Principal Engineer as needed to recover services |
Revision History
Note - prior to April 2023, CloudCard had a Business Continuity Plan, which was separate from the Disaster Recovery Plan. At the end of March 2023, we merged the two plans as part of our SOC 2 Compliance preparations.
Version | Date | Description | Author | Approved by |
1.1 (Disaster Recovery Plan) | October 2018 | Initial Plan | ||
1.1 (Disaster Recovery Plan) | November 2019 | Clarity and Accuracy Updates | ||
1.2 (Disaster Recovery Plan) | March 2021 | Updates to Contact Details | ||
1.2 (Disaster Recovery Plan) | February 2023 | Accuracy Update | ||
1.3 (Disaster Recovery Plan) | March 2023 | Updated to reflect Active-Active AWS strategy | ||
1.4 (Disaster Recovery Plan) | March 2023 | Improved based on results of testing of plan | ||
2.0 (Business Continuity and Disaster Recovery Plan) | 2023-03-29 | Merged Disaster Recovery with Business Continuity | Ryan Heathcote | Luke Rettstatt |
Appendices
Appendix: Disaster Recovery Strategies
AWS Multi-site Active-Active Strategy
Application load is distributed across multiple resources located in two or more physical locations (AWS Availability Zones). If one Availability Zone becomes unavailable, resources are automatically or manually provisioned in the healthy Availability Zone to handle the load from the first zone.
Specific resources following these strategies:
RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.
In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.
Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone until the load from the users was met.
S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (from https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )
Appendix: Diagnostic Steps
If the disaster affects the Virginia region:
consult news sources to gather information on impact of disaster
If employee safety could be affected, immediately direct the customer support team to confirm the safety of each employee, so that employee safety status information is available as soon as possible.
Check the CloudCard Internal Status Dashboard (an internal site which will be known to relevant employees which includes information on CloudCard system health and relevant AWS Status feeds)
Determine if CloudCard systems are experiencing downtime
Determine if AWS has published any notices.
Attempt to log into CloudCard
Attempt to log into the AWS console
Observe the state of the database and application environments:
Are the major components (autoscaling functionality, rds cluster) still operational?
Is autoscaling and failover functioning normally and recovering the services?
Is the service recovery trending towards normal within less than 15 minutes?
Based on the evidence gained from the above diagnostic steps, the Managing Director will decide, in consultation with the Principal Engineer, if a disaster has occurred. If the disaster corresponds to one of the scenarios in the Appendix: Scenarios, the Managing Director will direct the execution of the given checklist. If the disaster does not correspond to a prepared scenario, the Managing Director will consult with the Principal Engineer to determine the appropriate plan of action.
If AWS has not acknowledged the disaster on their public site, consider submitting an AWS support ticket to notify AWS of the issue.
Appendix: Employee Safety Confirmation Process
When an event could affect employee safety, CloudCard will confirm the safety of employees using this process.
The Managing Director, or the Customer Support team at the direction of the Managing Director, will post a message in the company-wide Marco Polo group, describing the situation and asking each employee to respond with their status. The Customer Support team will monitor the group to ensure all employees report back. The Customer Support team will follow up with any employee who does not respond quickly via alternative communications channels.
The purpose of this check is to make sure CloudCard is able to communicate with all employees and that all employees are safe. If an employee is not safe, CloudCard will attempt to provide that employee with resources (information or help) to assist in getting them to safety where possible, and continue to monitor the situation. CloudCard will also avoid assigning work to the employee until they are in a safe place and able to continue working.
Appendix: Scenarios
CloudCard uses SaaS / Cloud applications for most business critical functions so that all employees should be able to perform their work from any safe location that has power and a stable internet connection. Business continuity is therefore focused on (a) ensuring employee safety and (b) getting enough staff to a connected alternate work location to continue serving customers.
CloudCard’s production operations are hosted using AWS aws services with auto scaling and multi-site redundancy, and incremental continuous backups. Therefore our strategies for disasters in the cloud center around making sure failovers happen correctly, and creating new application environments from data backups when necessary.
These scenarios also assume that there exists a safe travel channel for a minimal number of employees to take to reach a safe, internet-connected working location, and that it is safe for the employee to leave their home. Employees should establish safety for themselves and their families / household prior to returning to work.
If a situation is so severe that there is no way for even a minimal number of CloudCard staff to safely relocate to an internet connected work location, we assume that it is immaterial for CloudCard to continue operations. For example, in the case of a natural disaster destroying power and network infrastructure across the entire Eastern and Central United States, CloudCard’s ability to continue operations, and the meaningfulness of those operations. In such situations, very few people will be connected to the internet at all, and so CloudCard’s employees should focus on finding safety and taking care of others. Once power or network infrastructure is restored to a sufficient extent, CloudCard should be able to resume continuity efforts according to one of the Scenarios below.
Disasters affecting AWS Availability Zone(s)
Plan of Action
Assemble team in the appropriate meeting room (Managing Director)
Pray
Monitor service Failover and deploy backup infrastructure (Principal Engineer)
Ensure database failover occurs; Add additional read replica if needed.
Ensure auto scaling replaces lost services with new nodes.
If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.
Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.
Determine if any secondary services are down, and recover them.
Determine service and data recovery timeframes (Principal Engineer)
If service is likely to be degraded for more than 15 minutes, Direct Communications Team to contact Customers and Resellers to make them aware of the situation (Managing Director)
Update the service updates page and direct customers to review it for updates: https://onlinephotosubmission.com/service-updates
Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)
Data / Infrastructure Sabotage or Human Error
In this scenario the assumption is that an attacker or an employee has intentionally or accidentally tampered with production resources to such an extent as to cause a major outage.
Triage - determine the actor causing the sabotage
If more appropriate, follow the Incident Response Plan.
Perform containment to ensure further actor access or action is prevented.
Follow the steps in Disasters affecting AWS Availability Zone(s) to restore services.
Take appropriate legal, disciplinary or training action.
Outages of Business Critical Services
Google Workspaces Email unavailable:
Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers via HelpScout.
Review news from Google to determine timeline for restoration of services.
As a last resort, if service is unlikely to be restored for a significant period, and other email providers are fine, set up temporary or permanent operations on a different email provider and repoint DNS records.
HelpScout unavailable:
Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers via Google Workspaces Email.
Review news from HelpScout to determine timeline for restoration of services.
As a last resort, if service is unlikely to be restored for a significant period, and other help desk providers are fine, set up temporary or permanent operations on a different helpdesk provider.
SquareSpace unavailable:
Set up a simple html static site in S3 and repoint dns for the website to the static site.
Customer Support team monitor HelpScout for customer questions.
Review news from SquareSpace to determine timeline for restoration of services.
As a last resort, if service is unlikely to be restored for a significant period, and other website hosting provider are fine, set up temporary or permanent operations on a different website hosting provider, or rebuild the website on the S3 static site.
Sales / Accounting / Task Tracking unavailable:
These services do not affect CloudCard’s immediate ability to serve customers. If they become unavailable, staff should use spreadsheets or manual systems to track information until the system comes back online, or establish an alternative provider.
AWS service (but not a whole AZ or region) unavailable:
Where possible, restore service in another region. (ensure cross-region replication of data where possible to facilitate recovery, as outages of a service are likely to be localized to a single region).
Update the CloudCard service updates page to indicate outage.
Customer Support team communicate with customers.
Review news from AWS to determine timeline for restoration of services.
Disasters affecting the CloudCard Office
This scenario assumes that Employee home offices are unaffected by the disaster and safe to work from and connected to the internet. For example, a fire in the office.
Ensure employee safety. Evacuate the building or area if necessary.
If safe to do so, Employees at the office relocate to home offices (30-60 minutes)
Verify internet connectivity at home offices (10 minutes)
Remotely resume normal operations
Disasters affecting the greater Lynchburg, VA area
This scenario assumes that both the CloudCard office and most employees home offices are affected by the disaster. Additionally, some locations within 1 hour driving time of Downtown Lynchburg are unaffected, and at least some of the affected employees can safely commute to an unaffected location.
Ensure employee safety. Evacuate to a safe location if necessary.
Managing Director finds and rents a coworking space or other working location within 1 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for affected employees. If necessary, multiple such locations could be established.
Employees work from the coworking space until their home office or the CloudCard office becomes available.
Disasters affecting most of Central Virginia
This scenario assumes that both the CloudCard office and most employees home offices are affected by the disaster. Also, the entire region within at least 1 hour driving time of Downtown Lynchburg is affected by the disaster. But there are locations within 12 hours driving time of Downtown Lynchburg that are unaffected, and at least some employees can safely commute to an unaffected location.
Ensure employee safety. Evacuate to a safe location if necessary.
Managing Director finds and rents a coworking space or other working location, and appropriate hotel space, within an 8 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for at least some employees.
A minimal group of employees is coordinated to stay at the hotel and work at coworking space.
Where possible, a rotational model will be established so that employees are able to return to their families frequently and are not burned out.
Major Disasters affecting multiple states surrounding Virginia
If it is impossible to find a location within 12 hours driving time of Downtown Lynchburg that is safe to work from, connected to the internet, and that at least some employees can commute to, we assume that it is immaterial for CloudCard to focus on continuity efforts at this time. The Managing Director should monitor the situation until a safe location becomes available, and maintain regular communications with employees to whatever degree possible to ensure their safety and arrange help where possible.
Distributed Denial of Service (DDoS) Attacks
AWS provides base level protection against DDoS and similar attacks. If a situation becomes more severe than the built-in AWS protection provides, Contact AWS support for assistance in dealing with the situation.
Death or incapacitation of key leader
CloudCard’s Managing Director and Principal Engineer, along with any other executive level roles should each designate another employee as their alternate. The alternate should be briefed on the responsibilities of the given role and able to perform interim responsibilities in case of death or incapacitation or other prolonged inability to work or advise of the person in the given role. The level of alternate briefing should be evaluated during executive vacation, and a debrief after each executive vacation should identify the areas of the alternate’s briefing that need to be improved.
Small Scale Events that are out of scope
The following are examples of events that are not large enough in scale to warrant activation of this plan.
Loss of connectivity for a single employee
Laptop failure for a single employee.
Loss of availability of a production application or service necessary to CloudCard’s operations that either (a) does either not affect all of CloudCard’s core services, or (b) is short-lived (outage lasting less than 4 hours)
Appendix: Asset RPO and RTO
Asset | Scenario | Recovery Strategy | Recovery Time Objective (RTO) | Recovery Point Objective (RPO) |
AWS Data and Services | Amazon data center failure or destruction | Autoscaling, failover, or restoration of backups | < 1 hour | < 1 hour |
Main Office | Major utility Outage | Alternate work location | < 1 hour | < 1 hour |
Employee Home Offices | Major utility outage | Alternate work location | < 12 hours | < 12 hours |
Google Workspaces | Major service outage | Rely on Google SLAs | ||
HelpScout | Major service outage | Use Gmail until service restored |
Appendix: Test Plan
The Managing Director and Principal Engineer will meet with all other relevant employees for the following:
Read through the plan and address any questions.
Test the employee safety confirmation process.
For each of the scenarios defined in Appendix: Scenarios, craft an example of that scenario, and walk through how the plan would be implemented in that scenario. Document the estimated time taken for each action; including failures to follow the plan that are discovered later in the conversation. For technical actions that can be simulated, note those actions for later simulation and continue the walkthrough.
Simulate the actions noted in step 2, and add the actual RPO and RTO achieved during these simulations to the walk-through notes. These actions should include (but are not limited to):
Test failover of database to another availability zone and adding a new read replica to the cluster.
Test scaling up the application cluster to introduce new servers in a different availability zone to replace others lost in the outage. Ensure that all availability zones in the region can be used by the cluster.
Test deploying a completely new database cluster from a database backup.
Test deploying a completely new application cluster.
Perform an after action review - collect all suggestions from all those included in the test for review.
Document the test results and after action review notes.
Update this Plan based on the results and suggestions.
Appendix: Planned Improvements
Use cross region replication of data to sustain disasters affecting an entire region. This replication should ensuring data residency compliance. For our Canadian customers, this will be implemented as the AWS Canada West (Calgary) Region becomes available.