AuditRubric
pr-ir-3 high Protect / Technology Infrastructure Resilience

Mechanisms are implemented to achieve resilience requirements in normal and adverse situations

Resilience is the ability to absorb disruption and keep running, or recover quickly when you cannot. Organizations that design only for the normal case discover their failure modes during an outage or attack. Building redundancy, failover capability, and graceful degradation into your architecture converts single points of failure from inevitabilities into managed risks.

Estimated effort: 8h
resilienceredundancyfailoverbcdrhigh-availabilityrto

Implementation steps

  1. 1

    Identify single points of failure in critical services

    For each service in your organization's critical path, map the components required for it to function and identify any single points of failure: a single database with no replica, a single load balancer, a single DNS provider, a single region. Each single point of failure is a risk that should be addressed, accepted with a documented recovery plan, or transferred.

    confluencemirolucidchart
  2. 2

    Implement redundancy for critical components

    Address identified single points of failure based on criticality: deploy database replication, use multi-AZ or multi-region deployments for critical services, use multiple load balancers behind a DNS failover, and distribute critical services across availability zones. Test failover capability by deliberately failing components in a non-production environment.

    awsgcpazurecloudflaredatadog
  3. 3

    Define and test recovery time objectives

    For each critical service, define a recovery time objective (RTO): how long is acceptable to be down? Design your resilience controls to meet that objective and test them. Run a quarterly or annual disaster recovery test where you simulate the loss of critical components and time the recovery. Use the results to identify gaps between your RTO target and actual recovery time.

    awsgcpazure

Evidence required

Architecture documentation showing resilience design

A current architecture diagram or documentation showing redundancy mechanisms for critical services.

  • · Architecture diagram showing multi-AZ or multi-region deployment
  • · Disaster recovery plan describing redundant components and failover procedures
  • · RTO and RPO documentation for critical services

Resilience test records

Records of testing that confirms resilience mechanisms work as designed.

  • · Failover test results from a disaster recovery exercise
  • · Chaos engineering test results showing system behavior under component failure
  • · Database failover test log showing successful promotion of replica

Related controls