Disaster Recovery Plan
Version | 1.12 |
Owner | CTO |
Last Updated on | Sep 2, 2024 |
Last Updated by | @Bruno Belizario |
Approved by | @Sean Oldfield |
Last Review | Sep 2, 2024 |
Subscribe to an RSS feed to be notified when we update Disaster Recovery Plan (note: you will need to cut and paste the "Subscribe to an RSS feed" URL into an RSS Feed Reader to monitor updates).
Information Technology Statement of Intent
This document delineates our policies and procedures for technology disaster recovery, as well as our process-level plans for recovering critical technology platforms and the telecommunications infrastructure. This document summarizes our recommended procedures. In the event of an actual emergency situation, modifications to this document may be made to ensure physical safety of our people, our systems, and our data.
Our mission is to ensure information system uptime, data integrity and availability, and business continuity.
For any questions relating to this document or our Security and Privacy, please contact us at issues@ecoportal.co.nz
Policy Statement
Management has approved the following policy statement:
The company shall have an IT disaster recovery plan;
The disaster recovery plan should cover all essential and critical infrastructure, in accordance with key business activities;
All staff must be made aware of the disaster recovery plan and their own respective roles;
The disaster recovery plan is to be kept up to date to take into account changing circumstances.
Objectives
The need to ensure that all employees fully understand their duties in implementing such a plan;
The need to ensure that operational policies are adhered to within all planned activities;
Disaster recovery capabilities as applicable to key customers, vendors and others.
Key Personnel Contact Info
Name | Position | Phone | Location | |
---|---|---|---|---|
Chief Technology Officer | +64 21 207 4880 | Auckland, New Zeland | ||
DevOps Engineer | +55 22 99967 9572 | Brazil | ||
Julian Santos | DevOps Engineer | +55 24 98818 7572 | julian@ecoportal.com | Brazil |
Sean Oldfield | Front End Lead Engineering | +64 21 947 878 | sean@ecoportal.com | Auckland, New Zeland |
1 Plan Overview
1.1 Plan Updating
It is necessary for the DRP updating process to be properly structured and controlled.
Whenever changes are made to the plan they are to be fully tested and appropriate amendments should be made to the training materials.
1.2 Plan Documentation Storage
Physical copies of this plan are available at the ecoPortal office at Mockba House, Level 2, 24 St Benedicts Street, Eden Terrace.
2 Emergency Response
2.1 Alert, escalation and plan invocation
2.1.1 Plan Triggering Events
Key trigger issues at headquarters that would lead to activation of the DRP are:
A core component of our cloud service offering has failed / stopped responding.
An information security incident has occurred and led to a significant loss of access, mass alteration or removal of production data.
2.1.3 Activation of Emergency Response Team
When an incident occurs the Emergency Response Team (ERT) must be activated. The ERT will then decide the extent to which the DRP must be invoked. Responsibilities of the ERT are to:
Assess the extent of the disaster and its impact on the business, data center, etc.;
Decide which elements of the DR Plan should be activated;
Establish and manage disaster recovery team to maintain vital services and return to normal operation;
Ensure employees are notified and allocate responsibilities and activities as required.
2.2 Disaster Recovery Team
The team will be contacted and assembled by the ERT. The team's responsibilities include:
Establish facilities for an emergency level of service;
Restore key services;
Recover to business as usual;
Coordinate activities with disaster recovery team, first responders, etc.;
Report to the emergency response team.
2.3 Emergency Alert, Escalation and DRP Activation
This policy and procedure has been established to ensure that in the event of a disaster or crisis, personnel will have a clear understanding of who should be contacted. Procedures have been addressed to ensure that communications can be quickly established while activating disaster recovery.
The DR plan will rely principally on key members of management and staff who will provide the technical and management skills necessary to achieve a smooth technology and business recovery.
2.3.1 Emergency Alert
The person discovering the incident calls a member of the Emergency Response Team in the order listed:
Emergency Response Team:
If not available try:
The Emergency Response Team (ERT) is responsible for activating the DRP for disasters identified in this plan, as well as in the event of any other occurrence that affects the company’s capability to perform normally.
One of the tasks during the early stages of the emergency is to notify the Disaster Recovery Team (DRT) that an emergency has occurred. The Business Recovery Team (BRT) will consist of senior representatives from the main business departments. The BRT Leader will be a senior member of the company's management team and will be responsible for taking overall charge of the process and ensuring that the company returns to normal working operations as early as possible.
Business Recovery Team:
Name | Position | Phone | Location | |
Yann Teboul | Chief Customer Officer | +64 21 363 636 | Auckland, New Zealand | |
Daniel Alexander | Chief Strategy & Product | +64 27 207 0858 | Auckland, New Zealand | |
Raphael Santos | Chief Technology Officer | +64 21 207 4880 | Auckland, New Zealand | |
Manuel Seidel | Chief Executive Officer (BRT Leader) | +64 27 352 8440 | Auckland, New Zealand |
2.3.2 Contact with Authorities and Notification of Customers
In the case of an Information Security Incident, members of the management team should notify authorities relevant to the nature of the incident, including the police, fire, NZ CERT and Office of the Australian Commissioner (for notifiable data breaches).
Should the incident affect customer data, all impacted customers should be notified and kept informed of the severity of the incident, its scope and the progress being made to restore service.
2.3.3 DR Procedures for Management
Members of the management team will keep a hard copy of the names and contact numbers of each employee in their departments. In addition, management team members will have a hard copy of the company’s disaster recovery and business continuity plans on file in their homes in the event that the headquarters building is inaccessible, unusable, or destroyed.
2.3.4 Contact with Employees
Managers will serve as the focal points for their departments, while designated employees will call other employees to discuss the crisis/disaster and the company’s immediate plans. Employees who cannot reach staff on their call list are advised to call the staff member’s emergency contact to relay information on the disaster.
2.3.5 Backup Staff
If a manager or staff member designated to contact other staff members is unavailable or incapacitated, the designated backup staff member will perform notification duties.
2.3.6 Alternate Work Facilities
If the ecoPortal office becomes unavailable due to a disaster, all staff shall work remotely from their homes or any safe location.
2.3.7 Communication, Event Log and Situation Report
Status Page tool is used to enhance our event logging, action tracking, communication, and situation reporting processes. The Status Page will be our primary platform for maintaining detailed event logs and action logs, ensuring that all incidents and responses are documented and tracked in real time.
3 Recovery Objectives
3.1 Recovery Time Objective (RTO)
In our commitment to maintaining optimal business operations, we have established a Recovery Time Objective (RTO) of 4 hours, representing the maximum allowable downtime in the event of a disruption. This RTO is a critical component of our business continuity plan, outlining the targeted duration within which our processes and systems must be restored to avoid any adverse impact on our operations.
A significant aspect of our recovery strategy involves acknowledging our dependency on third-party providers. These external entities play a pivotal role in the overall resilience of our systems. As part of our risk mitigation efforts, we have strategically chosen leading vendors renowned for their reliability and robust infrastructure – Amazon Web Services (AWS) for cloud services and MongoAtlas for database management.
3.2 Recovery Point Objective (RPO)
The Recovery Point Objective (RPO) signifies the acceptable amount of data loss that ecoPortal can tolerate in the event of a disruption. With an RPO of 2 hours, we are committed to preserving data integrity by ensuring that, at most, only the last 2 hours of data would be lost in the event of an incident. This aligns with best practices for minimizing data loss and maintaining business continuity.
Implementing point-in-time recovery provides us the ability to restore data to any defined moment within the last 7 days. This flexibility is a strategic decision that recognizes the varying nature and timing of potential data incidents. By allowing recovery to specific points in time, you gain precision in restoring data, mitigating the impact of errors, data corruption, or unintended changes.
4 Employee Training and Awareness
Employees receive training during the implementation of the test strategies. This ensures they are well-informed and equipped to effectively execute and understand the testing processes.
Appendix A – Technology Disaster Recovery Plan of Core ecoPortal Systems
Type | Purpose | Provider |
---|---|---|
Processes | ecoPortal web, workers | AWS ECS |
| Overview: ecoPortal uses Amazon’s Elastic Container Service to host its processes within the Sydney datacenter.
Recovery Strategy 1: Containers in the cluster are distributed across at least two data centers. Web and worker processes in the cluster balance themselves across the data centers. Containers are automatically monitored and will automatically replace faulty processes. In the event of a data center outage, ecoPortal will continue to function out of the other data center at initially half capacity and automatically redeploy the remaining capacity to the functioning data center.
Recovery Strategy 2: In the event of severe service degradation or prolonged downtime across the Sydney data centers, we could look at transferring ecoPortal to other data centers in Australia. ecoPortal runs on docker which makes it fairly platform agnostic. This scenario is extremely unlikely however as AWS is the gold standard in data centers. A downtime event of this level would affect a large chunk of the internet in the APAC region and a move like this would take at least a week of work.
Recovery Strategy 3: A faster option to recover from a total failure in the Sydney region would be to deploy ecoPortal to an AWS datacenter on another Australia region.
Tests: Strategy 3 is tested every 12 months (commencing 2025) | |
Database | All client data | MongoDB Atlas |
| MongoDB Atlas disaster recovery plan.
Overview: MongoDB database with all client data.
Backup: Client data is actively distributed across multiple availability zones (data centres). Backups are redundant across two availability zones and continuous (within a couple of seconds).
Additionally we make a dump backup every day and keep it in the S3 bucket. We keep the last 14 backups there.
Recovery Strategy 1: In the event of a zone going down, the database will seamlessly failover to an active secondary with no service interruption.
Recovery Strategy 2: In the event of data loss, we implement point in time backup and restore. Additionally we frequently test the restoration process on our staging environments. Tests: Both strategies are tested every 6 months. | |
Database | Notification read/unread/email status | Elastic Cache |
| Elastic Cache disaster recovery plan.
Overview: Data stored in Elastic is mostly basic metadata around queuing emails and the notification tray, and as such is non-critical. The system could operate without it, experiencing only a temporary loss of notifications.
Backup: This data is actively distributed across two servers and backed up to disk should the server crash. Backups are taken daily, stored on S3 and replicated across two continents.
Recovery strategy: Should one datacenter or server fail, the service will failover seamlessly to the other with no loss of data.
Test Strategy: Strategy is tested every 6 months (commencing 2025) | |
Search | Search index | Amazon Opensearch |
| Amazon Opensearch Service.
Overview: Some client data is reprocessed and optimized for search, filtering and analytics. All data stored in Opensearch is derivative of data stored in MongoDB. Our cluster run multiple active instances of Opensearch in multiple data centers for failover and redundancy.
Recovery strategy 1: Spin up a new Opensearch cluster and trigger the indexation process using data stored in MongoDB.
Recovery strategy 2: When more catastrophic problems occur it is possible to rehost with a variety of service providers (e.g. Elastic Cloud) or a direct instance, with only service degradation while ecoPortal reindexes content for search to the new location.
Tests: Strategy 1 is tested every 6 months. Strategy 2 will be tested every 24 months (commencing 2025) | |
All transactional email, notifications | SendGrid | |
| Sendgrid disaster recovery plan.
Overview: SendGrid handles sending emails. It is highly used, processing over ten billion emails per month on behalf of a huge variety of organisations and individuals, big and small.
Recovery strategy 1: In the highly unlikely event of service disruption, we are able to swap the mail server used to a competitor or our own hosted SMTP server within minutes. | |
DNS | Resolve domains | Cloudflare |
| Cloudflare disaster recovery plan.
Overview: Cloudflare Service Level Agreement says to be 100% available during a monthly billing cycle.
Recovery strategy 1: It has fully redundant, geo-distributed architecture, so in the case of a disaster there is automatic redundancy.
Recovery strategy 2: In the highly unlikely event of cessation of service, there are numerous similarly accredited competitors we can shift to. | |
SSL | Security layer for http transport | AWS Certificate Manager |
| Disaster recovery plan.
Overview: AWS is one of the biggest providers of this service so it is rather unlikely that something went wrong according to certificates.
Recovery strategy 1: It would be incredibly unlikely that they fail, but if needed certificates can be arranged and issued rapidly from DigiCert or other providers. |