Disaster Recovery Plan

Version

1.10

Owner

Head of Engineering

Last Updated on

Jan 31, 2024

Last Updated by

@Bruno Belizario

Approved by

@Raphael Santos

Information Technology Statement of Intent

This document delineates our policies and procedures for technology disaster recovery, as well as our process-level plans for recovering critical technology platforms and the telecommunications infrastructure. This document summarizes our recommended procedures. In the event of an actual emergency situation, modifications to this document may be made to ensure physical safety of our people, our systems, and our data. 

Our mission is to ensure information system uptime, data integrity and availability, and business continuity.

For any questions relating to this document or our Security and Privacy, please contact us at issues@ecoportal.co.nz

Policy Statement

Management has approved the following policy statement:

  • The company shall have an IT disaster recovery plan;

  • The disaster recovery plan should cover all essential and critical infrastructure, in accordance with key business activities;

  • All staff must be made aware of the disaster recovery plan and their own respective roles;

  • The disaster recovery plan is to be kept up to date to take into account changing circumstances.

Objectives 

  • The need to ensure that all employees fully understand their duties in implementing such a plan;

  • The need to ensure that operational policies are adhered to within all planned activities;

  • Disaster recovery capabilities as applicable to key customers, vendors and others.

Key Personnel Contact Info

Name

Position

Phone

Email

Location

Name

Position

Phone

Email

Location

Raphael Santos

Chief Technology Officer

+64 21 207 4880

raphael@ecoportal.com

Auckland, New Zeland

Bruno Belizario

DevOps Engineer

+55 22 99967 9572

bruno@ecoportal.com

Brazil

Julian Santos

DevOps Engineer

+55 24 98818 7572

julian@ecoportal.com

Brazil

Sean Oldfield

Front End Lead Engineering

+64 21 947 878

sean@ecoportal.com

Auckland, New Zeland

1 Plan Overview

1.1 Plan Updating

It is necessary for the DRP updating process to be properly structured and controlled.  

Whenever changes are made to the plan they are to be fully tested and appropriate amendments should be made to the training materials. 

1.2 Plan Documentation Storage

Physical copies of this plan are available at the ecoPortal office at Mockba House, Level 2, 24 St Benedicts Street, Eden Terrace.


2 Emergency Response

2.1 Alert, escalation and plan invocation

2.1.1 Plan Triggering Events

Key trigger issues at headquarters that would lead to activation of the DRP are: 

  • A core component of our cloud service offering has failed / stopped responding.

  • An information security incident has occurred and led to a significant loss of access, mass alteration or removal of production data.

2.1.3 Activation of Emergency Response Team

When an incident occurs the Emergency Response Team (ERT) must be activated.  The ERT will then decide the extent to which the DRP must be invoked.  Responsibilities of the ERT are to: 

  • Assess the extent of the disaster and its impact on the business, data center, etc.;

  • Decide which elements of the DR Plan should be activated;

  • Establish and manage disaster recovery team to maintain vital services and return to normal operation;

  • Ensure employees are notified and allocate responsibilities and activities as required.

2.2 Disaster Recovery Team

The team will be contacted and assembled by the ERT.  The team's responsibilities include:

  • Establish facilities for an emergency level of service;

  • Restore key services;

  • Recover to business as usual;

  • Coordinate activities with disaster recovery team, first responders, etc.;

  • Report to the emergency response team.

2.3 Emergency Alert, Escalation and DRP Activation

This policy and procedure has been established to ensure that in the event of a disaster or crisis, personnel will have a clear understanding of who should be contacted.  Procedures have been addressed to ensure that communications can be quickly established while activating disaster recovery.  

The DR plan will rely principally on key members of management and staff who will provide the technical and management skills necessary to achieve a smooth technology and business recovery.  

2.3.1 Emergency Alert

The person discovering the incident calls a member of the Emergency Response Team in the order listed:

Emergency Response Team:

If not available try:

The Emergency Response Team (ERT) is responsible for activating the DRP for disasters identified in this plan, as well as in the event of any other occurrence that affects the company’s capability to perform normally.  

One of the tasks during the early stages of the emergency is to notify the Disaster Recovery Team (DRT) that an emergency has occurred. The Business Recovery Team (BRT) will consist of senior representatives from the main business departments. The BRT Leader will be a senior member of the company's management team and will be responsible for taking overall charge of the process and ensuring that the company returns to normal working operations as early as possible. 

2.3.2 Contact with Authorities and Notification of Customers

In the case of an Information Security Incident, members of the management team should notify authorities relevant to the nature of the incident, including the police, fire and NZ CERT.

Should the incident affect customer data, all impacted customers should be notified and kept informed of the severity of the incident, its scope and the progress being made to restore service. This information is only to be held back if the nature of the information would increase the impact of the event before remedial steps can be undertaken.

2.3.3 DR Procedures for Management

Members of the management team will keep a hard copy of the names and contact numbers of each employee in their departments.  In addition, management team members will have a hard copy of the company’s disaster recovery and business continuity plans on file in their homes in the event that the headquarters building is inaccessible, unusable, or destroyed. 

2.3.4 Contact with Employees

Managers will serve as the focal points for their departments, while designated employees will call other employees to discuss the crisis/disaster and the company’s immediate plans.  Employees who cannot reach staff on their call list are advised to call the staff member’s emergency contact to relay information on the disaster.

2.3.5 Backup Staff

If a manager or staff member designated to contact other staff members is unavailable or incapacitated, the designated backup staff member will perform notification duties. 

3 Recovery Objectives

3.1 Recovery Time Objective (RTO)

In our commitment to maintaining optimal business operations, we have established a Recovery Time Objective (RTO) of 4 hours, representing the maximum allowable downtime in the event of a disruption. This RTO is a critical component of our business continuity plan, outlining the targeted duration within which our processes and systems must be restored to avoid any adverse impact on our operations.

A significant aspect of our recovery strategy involves acknowledging our dependency on third-party providers. These external entities play a pivotal role in the overall resilience of our systems. As part of our risk mitigation efforts, we have strategically chosen leading vendors renowned for their reliability and robust infrastructure – Amazon Web Services (AWS) for cloud services and MongoAtlas for database management.

3.2 Recovery Point Objective (RPO)

The Recovery Point Objective (RPO) signifies the acceptable amount of data loss that ecoPortal can tolerate in the event of a disruption. With an RPO of 2 hours, we are committed to preserving data integrity by ensuring that, at most, only the last 2 hours of data would be lost in the event of an incident. This aligns with best practices for minimizing data loss and maintaining business continuity.

Implementing point-in-time recovery provides us the ability to restore data to any defined moment within the last 7 days. This flexibility is a strategic decision that recognizes the varying nature and timing of potential data incidents. By allowing recovery to specific points in time, you gain precision in restoring data, mitigating the impact of errors, data corruption, or unintended changes.


Appendix A – Technology Disaster Recovery Plan of Core ecoPortal Systems

Type

Purpose

Provider

Type

Purpose

Provider

Processes

ecoPortal web, workers

AWS ECS

 

Overview:

ecoPortal uses Amazon’s Elastic Container Service to host its processes within the Sydney datacenter.

 

Recovery Strategy 1:

Containers in the cluster are distributed across at least two data centers. Web and worker processes in the cluster balance themselves across the data centers. Containers are automatically monitored and will automatically replace faulty processes. In the event of a data center outage, ecoPortal will continue to function out of the other data center at initially half capacity and automatically redeploy the remaining capacity to the functioning data center.

 

Recovery Strategy 2:

In the event of severe service degradation or prolonged downtime across the Sydney data centers, we could look at transferring ecoPortal to other data centers in Australia. ecoPortal runs on docker which makes it fairly platform agnostic. This scenario is extremely unlikely however as AWS is the gold standard in data centers. A downtime event of this level would affect a large chunk of the internet in the APAC region and a move like this would take at least a week of work.

 

Recovery Strategy 3:A faster option to recover from a total failure in the Sydney region would be to deploy ecoPortal to an AWS datacenter on another Australia region.

 

Tests: Strategy 1 is tested every 6 months.

Database

All client data

MongoDB Atlas

 

MongoDB Atlas disaster recovery plan.

 

Overview:

MongoDB database with all client data.

 

Backup:

Client data is actively distributed across multiple availability zones (data centres). Backups are redundant across two availability zones and continuous (within a couple of seconds).

 

Additionally we make a dump backup every day and keep it in the S3 bucket. We keep the last 14 backups there.

 

Recovery Strategy 1:

In the event of a zone going down, the database will seamlessly failover to an active secondary with no service interruption. 

 

Recovery Strategy 2:

In the event of data loss, we implement point in time backup and restore. Additionally we frequently test the restoration process on our staging environments.

Tests: Both strategies are tested every 6 months.

Database

Notification read/unread/email status

Elastic Cache

 

Elastic Cache disaster recovery plan. 

 

Overview:

Data stored in Elastic is mostly basic metadata around queuing emails and the notification tray, and as such is non-critical. The system could operate without it, experiencing only a temporary loss of notifications. 

 

Backup:

This data is actively distributed across two servers and backed up to disk should the server crash. Backups are taken daily, stored on S3 and replicated across two continents.

 

Recovery strategy:

Should one datacenter or server fail, the service will failover seamlessly to the other with no loss of data.

Search

Search index

Amazon Opensearch 

 

Amazon Opensearch Service.

 

Overview:

Some client data is reprocessed and optimized for search, filtering and analytics. All data stored in Opensearch is derivative of data stored in MongoDB. Our cluster run multiple active instances of Opensearch in multiple data centers for failover and redundancy. 

 

Recovery strategy 1:

Spin up a new Opensearch cluster and trigger the indexation process using data stored in MongoDB.

 

Recovery strategy 2:

When more catastrophic problems occur it is possible to rehost with a variety of service providers (e.g. Elastic Cloud) or a direct instance, with only service degradation while ecoPortal reindexes content for search to the new location.

 

Tests: Strategy 1 is tested every 6 months.

Email

All transactional email, notifications

SendGrid

 

Sendgrid disaster recovery plan.

 

Overview:

SendGrid handles sending emails. It is highly used, processing over ten billion emails per month on behalf of a huge variety of organisations and individuals, big and small. 

 

Recovery strategy 1:

In the highly unlikely event of service disruption, we are able to swap the mail server used to a competitor or our own hosted SMTP server within minutes.

DNS

Resolve domains

Cloudflare

 

Cloudflare disaster recovery plan.

 

Overview:

Cloudflare Service Level Agreement says to be 100% available during a monthly billing cycle.

 

Recovery strategy 1:

It has fully redundant, geo-distributed architecture, so in the case of a disaster there is automatic redundancy. 

 

Recovery strategy 2:

In the highly unlikely event of cessation of service, there are numerous similarly accredited competitors we can shift to.

SSL

Security layer for http transport

AWS Certificate Manager

 

Disaster recovery plan.

 

Overview:

AWS is one of the biggest providers of this service so it is rather unlikely that something went wrong according to certificates.

 

Recovery strategy 1:

It would be incredibly unlikely that they fail, but if needed certificates can be arranged and issued rapidly from DigiCert or other providers.