a1-2 High priority Availability / Availability

Environmental and infrastructure protections support system availability

Physical and environmental risks: power failures, hardware failures, natural disasters, and network outages can take systems offline regardless of software controls. For organizations operating their own infrastructure, this means physical protections. For cloud-native organizations, this means architectural choices: redundancy, multi-availability-zone deployments, and leveraging managed services with built-in resilience.

16h estimated effort

availability redundancy multi-az resilience

Complete first: a1-1

Implementation steps

1

Deploy across multiple availability zones

Architect your infrastructure to run across multiple availability zones (AZs) within your primary region. An AZ failure should not cause a service outage. Use load balancers that distribute traffic across AZs. Ensure databases have read replicas or standby instances in a secondary AZ. Document your multi-AZ architecture.

aws google-cloud azure terraform
2

Implement redundancy for critical components

Identify single points of failure in your architecture and eliminate them. Critical components: load balancers, application servers, databases, cache layers, and message queues should have redundant instances. Use managed services where possible (RDS Multi-AZ, ElastiCache with replication) rather than managing redundancy manually.

aws-rds aws-elasticache aws-elb cloudflare
3

Test and document recovery from infrastructure failures

Conduct chaos engineering exercises or infrastructure failure simulations to verify that redundancy works as expected. Test AZ failover, database failover, and application server replacement. Document the results and any gaps. Failure testing before an actual failure is far less costly.

aws-fault-injection-simulator chaos-monkey terraform

Evidence required

Multi-AZ architecture documentation

Evidence that infrastructure is deployed with redundancy across failure domains.

- Architecture diagram showing multi-AZ deployment
- AWS RDS Multi-AZ configuration
- Terraform code showing resources spread across availability zones

Infrastructure resilience testing records

Evidence that redundancy has been tested.

- Failover test results showing successful AZ recovery
- Chaos engineering exercise report
- Database failover test documentation

Related controls

a1-1

Capacity is managed to ensure system availability

Availability

high

a1-3

Recovery procedures restore system availability after disruptions

Availability

critical