Resiliency Overview
Last updated:
AZURERESILIENCY
What are we protecting against
- Software failure(App/OS,etc) - Replication
- Hardware failures - Replication
- Corruption - Backup/Snapshot
- Attack/DoS - Isolated backups
- Regulatory requirements - Backup
- Humans - Processes
Protection from Infrastructure Failures
- [[202404071441 Replication|Replication]]
- Stateless systems can run from different places, So things like web frontends, etc.
- For stateful systems,
- async copy to a different region
- [[202404071559 Azure Backup|Backup]]
- Point-in-time copies
- For stateful systems, it does not protect against hardware failure as it runs on the same disk. But if there is some logical issue or something, we can revert with this.
- [[202408041224 Azure Monitoring|Monitoring]]
- How are things being used
- so that we know when things are not working as they should
Protection from human errors
- Little contact with production systems
- Orchestration tools should be human proof
- Automated deployments from version control systems. Automated testing.
Understand services and dependent services
Basically understand architecture.
- What are critical services for my business - must protect - where is state? (critical)
- stateless things can be recreated
- In example below: state is at DB level. so that is critical. Rest can be recreated.
flowchart LR
subgraph stateless
LB1 --> WEB1 & WEB2 & WEB3 --> LB2 --> APP1 & APP2
end
subgraph state
APP1 & APP2 --> DB -.-> replicaDB
end
- what are the services these critical services depend on - must protect
- nice to have things
Understand requirements for availability and architect accordingly
- Need to architect to meet or exceed agreed SLA
- Depending on criticality of service you will have certain availability requirement
- Balancing between price of improving SLA vs cost of downtime
- For overall SLA is there a AND relationship or an OR relationship
Testing
- Application testing (testing with different types of input data for example)
- Load testing (testing with number of users)
- Deployment process testing (How you deploy)
- Failover testing
- Restore testing