Version 1.3

Quick Reference

If you believe a disaster has occurred that will affect our business, contact the Managing Director or the Principal Engineer.

Key contacts

CloudCard Support

Phone: (434) 253-5657
Email: support@onlinephotosubmission.com

Luke Rettstatt / Managing Director

Phone: (434) 253-5657
Email: luke@onlinephotosubmission.com

Anthony Erskine / Principal Engineer

Phone: (434) 248-0444
Email: tony@onlinephotosubmission.com

Key Locations

Primary Office:

1103 Wise Street

Lynchburg, VA 24504

Online Meeting Room:

https://onlinephotosubmission.com/meeting

External contacts

Name

Organization

Contact details

DR process owned

Todd Brooks

Color ID

Phone:

(704) 897-1959

Email:

Todd.Brooks@ColorID.com

Communicating status and process to current customer base.

Zack Walker

Vision Database Systems

Phone:

(561) 386-1534

Email: zack.walker@visiondatabase.com

Purpose

This document defines how CloudCard will respond to a disaster affecting our ability to serve our customers. The goal of the Disaster Recovery Plan is to restore services to the widest extent possible in the shortest possible time, while ensuring security and compliance are maintained.

Scope

A disaster for the purposes of this plan is defined as any event that causes prolonged unavailability of one or two AWS Availability Zones in the CloudCard’s primary operating region.

The following events are excluded from the scope for this plan:

Loss of availability of the entire AWS region (large scale events of this sort will be responded to on a case-by-case basis).
Loss of availability of CloudCard’s offices (see Business Continuity Plan)
Loss of availability of a production application or service necessary to CloudCard’s operations that either (a) does either not affect all of CloudCard’s core services, or (b) is short-lived (outage lasting less than 4 hours) (see Incident Response Plan)
Security breaches (see Incident Response Plan)

Policy

In the event of a disaster causing a major disruption to CloudCard’s production services, the person discovering the disaster must notify the Managing Director. The Managing Director will review the situation in consultation with the Principal Engineer and determine next steps. If the disaster falls within the scope above, the Managing Director should activate this Disaster Recovery Plan and follow the checklist appropriate for the given scenario (see Appendix: Scenarios).

Hard copies of this plan should be kept in each CloudCard office, as well as the home office of all employees.

Review

This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios.

Activation

This Disaster Recovery plan is to be activated when one or more of the following criteria are met:

An Amazon data center in which CloudCard stores its data is unavailable or is in imminent danger of becoming unavailable for an extended period of time.

The person discovering the potential disaster must notify the Managing Director (contact details listed above). If the Managing Director is unavailable, the Principle Engineer must be notified instead.

Communications Processes

Once notified of a potential disaster, if the Managing Director activates this Plan, the Managing Director will direct the Principal Engineer and all relevant employees to convene in the CloudCard Meeting Room (https://onlinephotosubmission.com/meeting). This online meeting room will be used as the primary mechanism to coordinate action and internally communicate status updates.

If the CloudCard Meeting Room is unavailable, the Managing Director will arrange an alternate digital or physical meeting room and communicate the location to the Principal Engineer and all relevant employees.

Roles and Responsibilities

Person

Roles

Responsibilities

Managing Director

Coordination and Communication

Determine activation of plan

Coordinate employee response

Communicate status internally and externally

Review and Test plan annually

Principal Engineer

Technical Execution; Alternate for Managing Director

Ensure all failovers complete smoothly

Deploy new infrastructure to replace failed infrastructure where necessary.

Review and Test plan annually

Revision History

Document Revision History is maintained in Confluence. Changes to this document should be made with comments indicating the reason for the change. To do this, use the menu and click “Publish with Version Comment” option in Confluence.

Version	Date	Comment
Current Version (v. 2)	Mar 10, 2023 15:12	Ryan Heathcote
v. 8	Mar 30, 2023 14:45	Ryan Heathcote
v. 7	Mar 29, 2023 20:59	Ryan Heathcote
v. 6	Mar 29, 2023 20:58	Ryan Heathcote
v. 5	Mar 29, 2023 20:41	Ryan Heathcote
v. 4	Mar 10, 2023 16:55	Ryan Heathcote
v. 3	Mar 10, 2023 15:30	Ryan Heathcote
v. 2	Mar 10, 2023 15:12	Ryan Heathcote
v. 1	Feb 24, 2023 14:04	Ryan Heathcote

This document was imported to Confluence from Google Docs on 2/24/2023. Below is the history of the document from Google Docs:

Version 1.1 - October 2018 - initial Google Docs version

Version 1.1 - November 2019 - clarity and accuracy updates

Version 1.2 - March 2021 - updates to contact details.

Version 1.2 - February 8, 2023 - accuracy update.

Version 1.2 - February 24, 2023 - migrated to confluence as v. 1 (see above).

Appendices

Appendix: Disaster Recovery Strategies

AWS Multi-site Active-Active Strategy

Application load is distributed across multiple resources located in two or more physical locations (AWS Availability Zones). If one Availability Zone becomes unavailable, resources are automatically or manually provisioned in the healthy Availability Zone to handle the load from the first zone.

Specific resources following these strategies:

RDS - core application database is provisioned on two servers: a primary server and a read replica, located in separate Availability Zones. In the event of a failure of the primary server, the database fails over to the read replica, which is promoted to become the primary server. The former primary server can be rebooted and recovered to become the read replica, or if it is completely out of commission, a new read replica can be instantiated.
- In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.
Elastic Beanstalk - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone until the load from the users was met.
S3 (Simple Storage Service) - redundantly stores objects on multiple devices redundantly store objects on multiple devices across a minimum of three Availability Zones in an AWS Region, and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone. (https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html )

Appendix: Scenarios

Single AZ failure

Plan of Action

Identify issue, coordinate initial response (Managing Director)
Contact Amazon to determine extent unavailability (Managing Director)
Evaluate unavailability timeframe and impact (Managing Director)
Monitor service Failover (Principal Engineer)
1. Ensure database failover occurs; Add additional read replica if needed.
2. Ensure auto scaling replaces lost services with new nodes.
3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.
4. Determine if any secondary services are down, and recover them.
Contact External Organizations to make them aware of the situation (Managing Director)
Determine service and data recovery timeframes (Principal Engineer)
Share timeframes with customers and external organizations (Managing Director)
Improve upon the process in case of a future disaster (Managing Director and Principal Engineer)

Two AZ failure