Backup and Disaster Recovery Design
This page explains the basics of planning backups and recovery as part of system design.
Why It Matters
Systems fail in different ways: accidental deletion, storage corruption, region outages, bad releases, or operator mistakes. Backup and recovery planning reduces downtime and data loss.
Core Concepts
- Backup frequency
- Retention period
- Recovery time objective (RTO)
- Recovery point objective (RPO)
- Restore testing
Basic Design Approach
- Identify critical data and services.
- Define acceptable downtime and data loss.
- Choose backup frequency and storage location.
- Document restore procedures.
- Test recovery regularly.
Common Risks
- Backups exist but restores were never tested
- Backups stored in the same failure domain
- No one knows the real recovery steps
- Databases and application state are treated inconsistently
Practical Advice
- Test restore procedures regularly
- Keep backup ownership clear
- Store backups away from the primary failure domain
- Write recovery steps for the team, not just for one operator