Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Current »

Policy Owner: Managing Director

Effective Date: 2023-05-01

Quick Reference

If you become aware of an imminent or active disaster that affects CloudCard, immediately contact the Managing Director or the Principal Engineer as soon as possible, via phone call if possible.

If a disaster has occurred, please contact your supervisor or the Customer Support team as soon as possible to let them know if you are safe.

Key contacts

CloudCard Support

Luke Rettstatt / Managing Director

Anthony Erskine / Principal Engineer

Key Locations

Primary Office:

1103 Wise Street, Lynchburg, VA 24504

Online Meeting Room:

https://onlinephotosubmission.com/meeting

Contents

Purpose

CloudCard wants to maintain services and support for our customers and ability to work for our employees as far as possible. When disasters happen that affect our ability deliver services, support customers, do our work, or basic safety, we need to be prepared to respond. The goal of this plan is to ensure the safety of our employees and restore services and operations to the greatest extent possible in the shortest possible time, while maintaining security and compliance. This plan provides guidance for responses to significant detrimental events, but is not intended to document daily problem resolution procedures.

Scope

All business critical IT Systems, especially those systems providing services to customers or facilitating customer support.

Any event that causes prolonged degradation of CloudCard services, or the inability of CloudCard employees to perform core business functions (customer support, operation of services, security of operations).

This policy applies to all employees of CloudCard and to all relevant external parties, including but not limited to CloudCard consultants and contractors.

Policy

In the event of a major disruption to production services, or a disaster affecting either the business critical systems used for CloudCard operations, or a disaster affecting the safety, security or ability to work of a significant number of CloudCard’s employees, the Managing Director shall diagnose the situation and direct mitigating actions.

Appendix: Diagnostic Steps provides guidance on determining what is affected by the disaster.

Where possible, mitigating actions are prepared and described in scenario-specific action plans in Appendix: Scenarios. The Managing Director will follow these prepared action plans when appropriate. For situations that have not been preplanned, the Managing Director will coordinate mitigating actions in consultation with the Principal Engineer.

Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all relevant employees.

The following factors are to be considered in planning mitigations:

  • Employee Safety

  • Continuity of information security

  • Continuity of compliance

  • Continuity of operations.

In the case of an information security event or incident, refer to the Incident Response Plan.

Review

This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios. The agenda for a testing exercise should be maintained in Appendix: Test Plan.

Activation

This Business Continuity and Disaster Recovery plan is to be activated when one or more of the following criteria are met:

An Amazon data center in which CloudCard stores its data is unavailable or is in imminent danger of becoming unavailable for an extended period of time.

A system supporting a core CloudCard business function is unavailable or is in imminent danger of becoming unavailable for an extended period of time.

A significant number of CloudCard employees are unable to work or in imminent danger of being unable to work for an extended period of time.

Examples of situations that would cause the above criteria to be met are:

  • loss of utility service (water, power, heating fuel)

  • loss of internet connectivity

  • catastrophic events (weather, natural disaster, vandalism)

The person discovering the potential or actual disaster must notify the Managing Director (contact details listed above). If the Managing Director is unavailable, the Principle Engineer must be notified instead.

Communications Processes

Once notified of a potential disaster, the Managing Director will consult with the Principal engineer and follow appropriate diagnostic steps. Once the disaster has been diagnosed and the plan activated, the Managing Director will direct the Principal Engineer and all relevant employees to convene in the CloudCard Meeting Room (https://onlinephotosubmission.com/meeting). This online meeting room will be used as the primary mechanism to coordinate action and internally communicate status updates.

If the CloudCard Meeting Room is unavailable, the Managing Director will arrange an alternate digital or physical meeting room and communicate the location to the Principal Engineer and all relevant employees.

The Customer Support team will handle proactive communications to customers and resellers, and respond to questions from resellers and customers.

If employee safety is threatened, the first action should be to communicate with all employees to ensure their safety (see Appendix: Employee Safety Confirmation Process).

Alternate Work Facilities

If the CloudCard office becomes unavailable due to a disaster, all staff should work remotely from their homes or any safe location. Similarly, if an employee’s home office becomes unavailable or unsafe, the employee should first seek safety, and once safe, work from the office or find a safe alternate work location.

If necessary, the Managing Director should procure a temporary work location (e.g. coworking space membership) and accommodations in a location unaffected by the disaster so that affected employees can continue working.

All tools and processes used to conduct regular operations at CloudCard should be conducive to remote work to the greatest extent possible. For example, the use of web applications over encrypted channels is preferred to private server applications that require users to be on the network. Security controls should assume and account for remote work.

Continuity of Critical Services

Procedures for maintaining continuity of critical services in a disaster can be found in Appendix: Scenarios.

Recovery Time Objectives (RTO) and Recovery Point Objects (RPO) can be found in Appendix: Asset RPO and RTO.

Strategy for maintaining continuity of services:

KEY BUSINESS PROCESS

CONTINUITY STRATEGY

Customer Service Delivery

Rely on AWS availability commitments and SLAs; use multi-site active active, with cross-region backups where possible.

IT Operations

Use SaaS applications or AWS hosted applications to ensure operations do not depend on a single physical location and are conducive to remote work arrangements.

Email

Utilize Gmail and its distributed nature, rely on Google’s standard service level agreements.

Customer Support

All systems are vendor-hosted SaaS applications, use Gmail as communications channel if helpdesk is down.

Finance, Legal and HR

All systems are vendor-hosted SaaS applications.

Sales and Marketing

All systems are vendor-hosted SaaS applications.

Roles and Responsibilities

Person

Roles

Responsibilities

Managing Director

Coordination and Communication

Determine activation of plan

Coordinate employee response

Coordinate communication of status internally and externally

Prioritize activities to ensure safety, security, and core services are maintained or restored as soon as possible

Work with Customer Support team to ensure employee safety

Review and Test plan annually

Ensure pizza is provided

Principal Engineer

Technical Execution; Alternate for Managing Director

Provide technical guidance on mitigating actions to the Managing Director

Ensure all failovers complete smoothly

Deploy new infrastructure to replace failed infrastructure where necessary.

Review and Test plan annually

Designate and brief alternate person in case of unavailability.

Customer Support Team

Communication

Communicate with employees to ensure safety

Monitor employee safety status

Communicate status to customers and resellers

Handle questions from customers and resellers

Engineering Team

Technical Execution

Support Principal Engineer as needed to recover services

Revision History

Note - prior to April 2023, CloudCard had a Business Continuity Plan, which was separate from the Disaster Recovery Plan. At the end of March 2023, we merged the two plans as part of our SOC 2 Compliance preparations.

Version

Date

Description

Author

Approved by

1.1 (Disaster Recovery Plan)

October 2018

Initial Plan

1.1 (Disaster Recovery Plan)

November 2019

Clarity and Accuracy Updates

1.2 (Disaster Recovery Plan)

March 2021

Updates to Contact Details

1.2 (Disaster Recovery Plan)

February 2023

Accuracy Update

1.3 (Disaster Recovery Plan)

March 2023

Updated to reflect Active-Active AWS strategy

1.4 (Disaster Recovery Plan)

March 2023

Improved based on results of testing of plan

2.0

(Business Continuity and Disaster Recovery Plan)

2023-03-29

Merged Disaster Recovery with Business Continuity

Ryan Heathcote

Luke Rettstatt

Appendices

Appendix: Disaster Recovery Strategies

AWS Multi-site Active-Active Strategy

Application load is distributed across multiple resources located in two or more physical locations (AWS Availability Zones). If one Availability Zone becomes unavailable, resources are automatically or manually provisioned in the healthy Availability Zone to handle the load from the first zone.

Specific resources following these strategies:

  • RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. Additionally, Aurora databases use a separate redundant storage layer independent of the servers that is distributed across all availability zones in a region. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.

    • In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.

    • New servers can be established from the storage layer, typically in less than 30 minutes.

  • Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone(s) until the load from the users was met.

  • S3 (Simple Storage Service) - redundantly stores objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (from https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )

Appendix: Diagnostic Steps

  1. If the disaster affects the Virginia region:

    1. consult news sources to gather information on impact of disaster

    2. If employee safety could be affected, immediately direct the Customer Support Team to follow the Appendix: Employee Safety Confirmation Process.

  2. Check the CloudCard Internal Status Dashboard (an internal site which will be known to relevant employees which includes information on CloudCard system health and relevant AWS Status feeds)

    1. Determine if CloudCard systems are experiencing downtime

    2. Determine if AWS has published any notices.

  3. Attempt to log into CloudCard

  4. Attempt to log into the AWS console

  5. Observe the state of the database and application environments:

    1. Are the major components (autoscaling functionality, RDS cluster) still operational?

    2. Is autoscaling and failover functioning normally and recovering the services?

    3. Is the service recovery trending towards normal within less than 15 minutes?

Based on the evidence gained from the above diagnostic steps, the Managing Director will decide, in consultation with the Principal Engineer, if a disaster has occurred. If the disaster corresponds to one of the scenarios in the Appendix: Scenarios, the Managing Director will direct the execution of the given checklist. If the disaster does not correspond to a prepared scenario, the Managing Director will consult with the Principal Engineer to determine the appropriate plan of action.

If AWS has not acknowledged the disaster on their public site, consider submitting an AWS support ticket to notify AWS of the issue.

Appendix: Employee Safety Confirmation Process

When an event could affect employee safety, CloudCard will confirm the safety of employees using this process.

  1. The Managing Director, or the Customer Support team at the direction of the Managing Director, will post a message in the company-wide Marco Polo group, describing the situation and asking each employee to respond with their status.

  2. Employees will report back their status.

  3. The Customer Support team will monitor the group to ensure all employees report back.

  4. The Customer Support team will follow up with any employee who does not respond quickly via alternative communications channels.

  5. If an employee is not safe, CloudCard will attempt to provide that employee with resources (information or help) to assist in getting them to safety where possible, and continue to monitor the situation.

Appendix: Scenarios

CloudCard’s production operations are hosted using AWS aws services with auto scaling and multi-site redundancy, and incremental continuous backups. Therefore our strategies for disasters in the cloud center around making sure failovers happen correctly, and creating new application environments from data backups when necessary.

CloudCard uses SaaS / Cloud applications for most business critical functions so that all employees should be able to perform their work from any safe location that has power and a stable internet connection. Business continuity is therefore focused on (a) ensuring employee safety and (b) getting enough staff to a connected alternate work location to continue serving customers.

These scenarios also assume that there exists a safe travel channel for a minimal number of employees to take to reach a safe, internet-connected working location (if their home does not qualify), and that it is safe for the employee to leave their home. Employees should establish safety for themselves and their families / household prior to returning to work.

If a situation is so severe that it is not possible for even a minimal number of CloudCard staff to safely relocate to an internet-connected work location, we assume that it is immaterial for CloudCard to continue operations. For example, a natural disaster destroying power and network infrastructure across the entire Eastern and Central United States - in this situation, very few people will be connected to the internet at all, so CloudCard’s employees should focus on finding safety and taking care of others until power or network infrastructure is restored to a sufficient extent that CloudCard can resume continuity efforts according to one of the Scenarios below.

Disasters affecting AWS Availability Zone(s) or Individual Services

Plan of Action

  1. Assemble team in the appropriate meeting room (Managing Director)

  2. Pray

  3. Monitor service Failover and deploy backup infrastructure (Principal Engineer)

    1. Ensure database failover occurs; Add additional read replica if needed.

    2. Ensure auto scaling replaces lost services with new nodes.

      1. If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down, and recover them.

  4. Determine service and data recovery timeframes (Principal Engineer)

  5. If service is likely to be degraded for more than 15 minutes, Direct Communications Team to contact Customers and Resellers to make them aware of the situation (Managing Director)

    1. Update the service updates page and direct customers to review it for updates: https://onlinephotosubmission.com/service-updates

  6. Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

Data / Infrastructure Sabotage or Human Error

In this scenario the assumption is that an attacker or an employee has intentionally or accidentally tampered with production resources to such an extent as to cause a major outage.

  1. Triage - determine the actor causing the sabotage

    1. If more appropriate, follow the Incident Response Plan.

  2. Perform containment to ensure further actor access or action is prevented.

  3. Follow the steps in Disasters affecting AWS Availability Zone(s) to restore services.

  4. Take appropriate legal, disciplinary or training action.

Outages of Business Critical Services

Google Workspaces Email unavailable:

  1. Review news from Google to determine timeline for restoration of services.

  2. Update the CloudCard service updates page to indicate outage.

  3. Customer Support team communicate with customers via HelpScout.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other email providers are fine, set up temporary or permanent operations on a different email provider and repoint DNS records.

HelpScout unavailable:

  1. Review news from HelpScout to determine timeline for restoration of services.

  2. Update the CloudCard service updates page to indicate outage.

  3. Customer Support team communicate with customers via Google Workspaces Email.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other help desk providers are fine, set up temporary or permanent operations on a different helpdesk provider.

SquareSpace unavailable:

  1. Review news from SquareSpace to determine timeline for restoration of services.

  2. Set up a simple html static site in S3 and repoint dns for the website to the static site.

  3. Customer Support team monitor HelpScout for customer questions.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other website hosting provider are fine, set up temporary or permanent operations on a different website hosting provider, or rebuild the website on the S3 static site.

Sales / Accounting / Task Tracking unavailable:

These services do not affect CloudCard’s immediate ability to serve customers. If they become unavailable, staff should use spreadsheets or manual systems to track information until the system comes back online. If the outage is likely to be prolonged, CloudCard should seek another service provider.

Disasters affecting the CloudCard Office

Assumptions:

  • Employee home offices are unaffected by the disaster, safe to work from, and connected to the internet.

Plan of Action:

  1. Ensure employee safety. Evacuate the building or area if necessary.

  2. If safe to do so, employees at the office relocate to home offices.

  3. Verify internet connectivity at home offices.

  4. Remotely resume normal operations.

Disasters affecting the greater Lynchburg, VA area

Assumptions:

  • The CloudCard office is unavailable

  • Most employees home offices are affected by the disaster

  • Some locations within 1 hour driving time of Downtown Lynchburg are unaffected

  • At least some of the affected employees can safely commute to an unaffected location

Plan of Action:

  1. Ensure employee safety. Evacuate to a safe location if necessary.

  2. Managing Director finds and rents a coworking space or other working location within 1 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for affected employees. If necessary, multiple such locations could be established.

  3. Employees work from the coworking space until their home office or the CloudCard office becomes available.

Disasters affecting most of Central Virginia

Assumptions:

  • The CloudCard office is unavailable

  • Most employees home offices are affected by the disaster

  • The entire region within at least 1 hour driving time of Downtown Lynchburg is affected by the disaster.

  • Some locations within 12 hours driving time of Downtown Lynchburg are unaffected

  • At least some employees can safely commute to an unaffected location.

Plan of Action:

  1. Ensure employee safety. Evacuate to a safe location if necessary.

  2. Managing Director finds and rents a coworking space or other working location, and appropriate hotel space, within an 8 hour drive of Downtown Lynchburg that is safe to work from, has sufficient internet connectivity, and has a safe commute for at least some employees.

  3. A minimal group of employees is coordinated to stay at the hotel and work at coworking space.

    1. Where possible, a rotational model will be established so that employees are able to return to their families frequently and are not burned out.

Major Disasters affecting multiple states surrounding Virginia

Assumptions:

  • The CloudCard office is unavailable

  • Most employees home offices are affected by the disaster

  • The entire region within at least 12 hours driving time of Downtown Lynchburg is affected by the disaster.

  • It is therefore impossible to find a location within 12 hours driving time of Downtown Lynchburg that is safe to work from, connected to the internet, and that at least some employees can commute to

Response:

  • It is immaterial for CloudCard to focus on continuity efforts at this time.

  • The Managing Director should monitor the situation until a safe location becomes available

  • The Managing Director should maintain regular communications with employees to whatever degree possible to ensure their safety and arrange help where possible.

Distributed Denial of Service (DDoS) Attacks

AWS provides base level protection against DDoS and similar attacks. If a situation becomes more severe than the built-in AWS protection provides, Contact AWS support for assistance in dealing with the situation.

Death or incapacitation of key leader

CloudCard’s key leaders are the Managing Director and the Principal Engineer.

CloudCard’s key leaders should each designate another employee as an alternate. The alternate should be briefed on the responsibilities of the given role and able to perform interim responsibilities in case of death or incapacitation, or planned absence of the key leader.

After a key leader takes time off from work, an after-action review should be performed to determine what gaps exist in the knowledge possessed by the alternate to perform interim key leader responsibilities.

All employee roles and responsibilities should be documented to enable other employees or new hires to assume responsibility.

Small Scale Events that are out of scope

The following are examples of events that are not large enough in scale to warrant activation of this plan.

  • Loss of connectivity for a single employee

  • Laptop failure for a single employee.

  • Loss of availability of a production application or service necessary to CloudCard’s operations that either (a) does either not affect all of CloudCard’s core services, or (b) is short-lived (outage lasting less than 4 hours)

Appendix: Asset RPO and RTO

Asset

Scenario

Recovery Strategy

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

AWS Data and Services

Amazon data center failure or destruction

Autoscaling, failover, or restoration of backups

< 1 hour

< 1 hour

Main Office

Major utility Outage

Alternate work location

< 1 hour

< 1 hour

Employee Home Offices

Major utility outage

Alternate work location

< 12 hours

< 12 hours

Google Workspaces

Major service outage

Rely on Google SLAs

HelpScout

Major service outage

Use Gmail until service restored

Appendix: Test Plan

The Managing Director and Principal Engineer will meet with all other relevant employees for the following:

  1. Read through the plan and address any questions.

  2. Test the employee safety confirmation process.

  3. For each of the scenarios defined in Appendix: Scenarios, craft an example of that scenario, and walk through how the plan would be implemented in that scenario. Document the estimated time taken for each action.

    1. For technical actions that can be simulated, note those actions for later simulation and continue the walkthrough.

  4. Simulate the actions noted during the walkthrough, and add the actual RPO and RTO achieved during these simulations to the walk-through notes. These actions should include (but are not limited to):

    1. Test failover of database to another availability zone and adding a new read replica to the cluster.

    2. Test scaling up the application cluster to introduce new servers in a different availability zone to replace others lost in the outage. Ensure that all availability zones in the region can be used by the cluster.

    3. Test deploying a completely new database cluster from a database backup.

    4. Test deploying a completely new application cluster.

  5. Perform an after action review - collect all suggestions from all those included in the test for review.

  6. Document the test results and after action review notes.

  7. Update this Plan based on the results and suggestions.

Appendix: Planned Improvements

  • Use cross region replication of data to sustain disasters affecting an entire region. This replication should ensuring data residency compliance. For our Canadian customers, this will be implemented as the AWS Canada West (Calgary) Region becomes available.

  • No labels