Mechanisms are implemented to achieve resilience requirements in normal and adverse situations
Resilience is the ability to absorb disruption and keep running, or recover quickly when you cannot. Organizations that design only for the normal case discover their failure modes during an outage or attack. Building redundancy, failover capability, and graceful degradation into your architecture converts single points of failure from inevitabilities into managed risks.
Implementation steps
- 1
Identify single points of failure in critical services
For each service in your organization's critical path, map the components required for it to function and identify any single points of failure: a single database with no replica, a single load balancer, a single DNS provider, a single region. Each single point of failure is a risk that should be addressed, accepted with a documented recovery plan, or transferred.
confluencemirolucidchart - 2
Implement redundancy for critical components
Address identified single points of failure based on criticality: deploy database replication, use multi-AZ or multi-region deployments for critical services, use multiple load balancers behind a DNS failover, and distribute critical services across availability zones. Test failover capability by deliberately failing components in a non-production environment.
awsgcpazurecloudflaredatadog - 3
Define and test recovery time objectives
For each critical service, define a recovery time objective (RTO): how long is acceptable to be down? Design your resilience controls to meet that objective and test them. Run a quarterly or annual disaster recovery test where you simulate the loss of critical components and time the recovery. Use the results to identify gaps between your RTO target and actual recovery time.
awsgcpazure
Evidence required
Architecture documentation showing resilience design
A current architecture diagram or documentation showing redundancy mechanisms for critical services.
- · Architecture diagram showing multi-AZ or multi-region deployment
- · Disaster recovery plan describing redundant components and failover procedures
- · RTO and RPO documentation for critical services
Resilience test records
Records of testing that confirms resilience mechanisms work as designed.
- · Failover test results from a disaster recovery exercise
- · Chaos engineering test results showing system behavior under component failure
- · Database failover test log showing successful promotion of replica
Related controls
Technology assets are protected from environmental threats
Technology Infrastructure Resilience
Backups of data are created, protected, maintained, and tested
Data Security
Networks and environments are protected from unauthorized logical access
Technology Infrastructure Resilience
Adequate resource capacity to ensure availability is maintained
Technology Infrastructure Resilience