Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Version 1.34

...

Quick Reference

If you believe a disaster has occurred that will affect our business, contact the Managing Director or the Principal Engineer.

Key contacts

CloudCard Support

  • Phone: (434) 253-5657

  • Email: support@onlinephotosubmission.com

Luke Rettstatt / Managing Director

Anthony Erskine / Principal Engineer

Key Locations

Primary Office:

1103 Wise Street, Lynchburg, VA 24504

Online Meeting Room:

https://onlinephotosubmission.com/meeting

External contacts

...

Name

...

Organization

...

Contact details

...

DR process owned

...

Todd Brooks

...

Color ID

...

Phone:

(704) 897-1959

Email:

Todd.Brooks@ColorID.com

...

Communicating status and process to current customer base.

...

Zack Walker

...

Vision Database Systems

Phone:

(561) 386-1534

...

Maintained in the /wiki/spaces/PU/pages/2528968705

Contents

Table of Contents
minLevel1
maxLevel7
outlinetrue

...

In the event of a disaster causing a major disruption to CloudCard’s production services, the person discovering the disaster must notify the Managing Director. The Managing Director will review the situation in consultation with the Principal Engineer (see Appendix: Diagnostics Steps) and determine next stepsthe plan of action. If the disaster falls within the scope above, the Managing Director should activate this Disaster Recovery Plan and follow the checklist appropriate for the given scenario (see Appendix: Scenarios).

...

Person

Roles

Responsibilities

Managing Director

Coordination and Communication

Determine activation of plan

Coordinate employee response

Communicate status internally and externally

Review and Test plan annually

Ensure pizza is provided

Principal Engineer

Technical Execution; Alternate for Managing Director

Ensure all failovers complete smoothly

Deploy new infrastructure to replace failed infrastructure where necessary.

Review and Test plan annually

Designate and brief alternate person in case of unavailability.

Customer Support Team

Communication

Communicate status to customers

Handle questions from customers

Engineering Team

Technical Execution

Support Principal Engineer as needed to recover services

Revision History

Version

Date

Changes

1.1

October 2018

Initial Plan

1.1

November 2019

Clarity and Accuracy Updates

1.2

March 2021

Updates to Contact Details

1.2

February 2023

Accuracy Update

1.3

March 2023

Updated to reflect Active-Active AWS strategy

1.4

March 2023

Improved based on results of testing of plan

Appendices

Appendix: Disaster Recovery Strategies

...

  • RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.

    • In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.

  • Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone until the load from the users was met.

  • S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (from https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )

Anchor
appendix_diagnostic
appendix_diagnostic
Appendix: Diagnostic Steps

  1. Check the CloudCard Internal Status Dashboard (an internal site which will be known to relevant employees which includes information on CloudCard system health and relevant AWS Status feeds)

    1. Determine if CloudCard systems are experiencing downtime

    2. Determine if AWS has published any notices.

  2. Attempt to log into CloudCard

  3. Attempt to log into the AWS console

  4. Observe the state of the database and application environments:

    1. Are the major components (autoscaling functionality, rds cluster) still operational?

    2. Is autoscaling and failover functioning normally and recovering the services?

    3. Is the service recovery trending towards normal within less than 15 minutes?

Based on the evidence gained from the above diagnostic steps, the Managing Director will decide, in consultation with the Principal Engineer, if a disaster has occurred. If the disaster corresponds to one of the scenarios in the below appendix, the Managing Director will direct the execution of the given checklist. If the disaster does not correspond to a prepared scenario, the Managing Director will consult with the Principal Engineer to determine the appropriate plan of action.

If AWS has not acknowledged the disaster on their public site, consider submitting an AWS support ticket to notify AWS of the issue.

Anchor
appendix_scenarios
appendix_scenarios
Appendix: Scenarios

Single AZ failure

Plan of Action

  1. Identify issue, coordinate initial response Assemble team in the appropriate meeting room (Managing Director)

  2. Contact Amazon to determine extent unavailability (Managing Director)

  3. Evaluate unavailability timeframe and impact (Managing Director)

  4. Pray

  5. Monitor service Failover (Principal Engineer)

    1. Ensure database failover occurs; Add additional read replica if needed.

    2. Ensure auto scaling replaces lost services with new nodes.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down, and recover them.

  6. Determine service and data recovery timeframes (Principal Engineer)

  7. Contact customers and External Organizations If service is likely to be degraded for more than 15 minutes, Direct Communications Team to contact Customers and Resellers to make them aware of the situation and share timeframes. (Managing Director)

    1. Update the service updates page and direct customers to review it for updates: https://onlinephotosubmission.com/service-updates

  8. Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

Two AZ failure

Plan of Action

  1. Identify issue, coordinate initial response Assemble team in the appropriate meeting room (Managing Director)

  2. Contact Amazon to determine extent unavailability (Managing Director)

  3. Evaluate unavailability timeframe and impact (Managing Director)Pray

  4. Deploy new infrastructure and restore data from backup (Principal Engineer)

    1. If database failover was successful, add additional read replica if needed.

      1. If the database completely failed, create a new cluster and restore from backup.

    2. If auto scaling infrastructure is still in place, ensure auto scaling replaces lost services with new nodes.

      1. If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down, and recover them.

  5. Determine service and data recovery timeframes (Principal Engineer)

  6. Contact External Organizations Direct Communications Team to contact Customers and Resellers to make them aware of the situation (Managing Director)

    1. Update the service updates page and direct customers to review it for updates: https://onlinephotosubmission.com/service-updates

  7. Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

...

  • Use cross region replication of data to sustain wide regional disasters (while disasters affecting an entire region. This replication should ensuring data residency compliance). For our Canadian customers, this will be implemented as the AWS Canada West (Calgary) Region becomes available.Update plan to address human error disasters.