a1-3 Critical priority Availability / Availability

Recovery procedures restore system availability after disruptions

Even the most resilient systems can be disrupted. Availability commitments require not just preventing downtime, but recovering from it within defined timeframes. This criterion requires that backup and recovery procedures are in place, tested, and capable of restoring service within the recovery time objectives committed to customers.

12h estimated effort

availability backup recovery rto rpo disaster-recovery

Complete first: a1-2

Implementation steps

1

Define and document recovery objectives (RTO and RPO)

Recovery Time Objective (RTO) is how quickly you must restore service. Recovery Point Objective (RPO) is how much data loss is acceptable. Document these objectives for each critical system. Your backup frequency and replication strategy must be capable of meeting your RPO. Your incident response and recovery procedures must be capable of meeting your RTO.

confluence notion google-docs
2

Implement automated backups for all critical data

Configure automated backups for all databases and data stores that contain customer data or data required for service operation. Backups should run at least daily; more frequently for high-RPO systems. Store backups in a separate location from production (separate region or account). Retain backups according to your RPO and any regulatory requirements.

aws-backup aws-rds google-cloud-backup azure-backup
3

Test recovery procedures at least annually

Backups and recovery procedures that have never been tested are often broken when needed most. At least annually, restore from backup in a non-production environment and verify data integrity and completeness. Document the test results including actual recovery time. Use the test to validate or update your RTO/RPO targets.

aws-backup aws-rds confluence notion