Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Key contacts

CloudCard Support

  • Phone: (434) 253-5657

  • Email: support@onlinephotosubmission.com

Luke Rettstatt / Managing Director

Anthony Erskine / Principal Engineer

Key Locations

Primary Office:

1103 Wise Street, Lynchburg, VA 24504

Online Meeting Room:

https://onlinephotosubmission.com/meeting

External contacts

Name

Organization

Contact details

DR process owned

Todd Brooks

Color ID

Phone:

(704) 897-1959

Email:

Todd.Brooks@ColorID.com

Communicating status and process to current customer base.

Zack Walker

Vision Database Systems

Phone:

(561) 386-1534

Email: zack.walker@visiondatabase.com

...

Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all relevant employees.

Review

This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios.

...

Person

Roles

Responsibilities

Managing Director

Coordination and Communication

Determine activation of plan

Coordinate employee response

Communicate status internally and externally

Review and Test plan annually

Ensure pizza is provided

Principal Engineer

Technical Execution; Alternate for Managing Director

Ensure all failovers complete smoothly

Deploy new infrastructure to replace failed infrastructure where necessary.

Review and Test plan annually

Designate and brief alternate person in case of unavailability.

Revision History

...

Version

...

Change History

This document was imported to Confluence from Google Docs on 2/24/2023. Below is the history of the document from Google Docs:

...

Date

Changes

1.1

...

October 2018

...

Initial Plan

1.1

...

November 2019

...

Clarity and Accuracy Updates

1.2

...

March 2021

...

Updates to Contact Details

1.2

...

February

...

2023

Accuracy Update

1.3

March 2023

Updated to reflect Active-Active AWS strategy

Appendices

Appendix: Disaster Recovery Strategies

...

  • RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.

    • In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.

  • Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone until the load from the users was met.

  • S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (from https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )

...

  1. Identify issue, coordinate initial response (Managing Director)

  2. Contact Amazon to determine extent unavailability (Managing Director)

  3. Evaluate unavailability timeframe and impact (Managing Director)

  4. Monitor service Failover (Principal Engineer)

    1. Ensure database failover occurs; Add additional read replica if needed.

    2. Ensure auto scaling replaces lost services with new nodes.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down, and recover them.

    Contact External Organizations to make them aware of the situation (Managing Director)
  5. Determine service and data recovery timeframes (Principal Engineer)

  6. Share timeframes with customers and external organizations Contact customers and External Organizations to make them aware of the situation and share timeframes. (Managing Director)

  7. Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

...

  1. Identify issue, coordinate initial response (Managing Director)

  2. Contact Amazon to determine extent unavailability (Managing Director)

  3. Evaluate unavailability timeframe and impact (Managing Director)

  4. Instruct with Principal Engineer to restore services from backups (Managing Director)

  5. Contact External Organizations to make them aware of the situation (Managing Director)Deploy new infrastructure and restore data from backup (Principal Engineer)

    1. If database failover was successful, add additional read replica if needed.

      1. If the database completely failed, create a new cluster and restore from backup.

    2. If auto scaling infrastructure is still in place, ensure auto scaling replaces lost services with new nodes.

      1. If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down, and recover them.

  6. Determine service and data recovery timeframes (Principal Engineer)

  7. Share timeframes with customers and external organizations Contact External Organizations to make them aware of the situation (Managing Director)

  8. Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

...