自动扩展和容灾
本章节详细介绍 Kubernetes 集群的自动扩展策略和灾难恢复方案,包括 HPA、VPA、Cluster Autoscaler、KEDA 以及 Velero 备份恢复。
自动扩展架构
三层扩展策略
┌─────────────────────────────────────────────────────┐
│ 应用层扩展(Horizontal Pod Autoscaler) │
│ ┌──────────────────────────────────────────────┐ │
│ │ 根据指标自动调整 Pod 副本数 │ │
│ │ ├─ CPU 利用率 │ │
│ │ ├─ 内存利用率 │ │
│ │ ├─ 自定义指标(QPS、队列长度等) │ │
│ │ └─ 外部指标(SQS、CloudWatch) │ │
│ └──────────────────────────────────────────────┘ │
└───────────────────┬─────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Pod 资源优化(Vertical Pod Autoscaler) │
│ ┌──────────────────────────────────────────────┐ │
│ │ 自动调整 Pod 的 requests 和 limits │ │
│ │ ├─ 分析历史资源使用 │ │
│ │ ├─ 推荐合理的资源配置 │ │
│ │ └─ 自动更新(可选) │ │
│ └──────────────────────────────────────────────┘ │
└───────────────────┬─────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ 节点层扩展(Cluster Autoscaler) │
│ ┌──────────────────────────────────────────────┐ │
│ │ 根据 Pod 调度需求自动调整节点数量 │ │
│ │ ├─ Pod pending 时增加节点 │ │
│ │ ├─ 节点利用率低时减少节点 │ │
│ │ ├─ 支持多节点组 │ │
│ │ └─ 尊重 PDB 和 Pod 优先级 │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
扩展流程
用户请求增加 ──────────────────────────────┐
↓
负载增加
↓
┌─────────────────────────────┐
│ HPA 检测到指标超过阈值 │
│ (CPU > 70% 或 自定义指标) │
└──────────┬──────────────────┘
↓
┌─────────────────────────────┐
│ HPA 增加 Pod 副本数 │
│ (例如:3 → 6) │
└──────────┬──────────────────┘
↓
┌─────────────────────────────┐
│ 某些 Pod 处于 Pending 状态 │
│ (节点资源不足) │
└──────────┬──────────────────┘
↓
┌─────────────────────────────┐
│ Cluster Autoscaler 检测到 │
│ 有 Pending 的 Pod │
└──────────┬──────────────────┘
↓
┌─────────────────────────────┐
│ Cluster Autoscaler 增加节点 │
│ (向 AWS ASG 请求扩容) │
└──────────┬──────────────────┘
↓
┌─────────────────────────────┐
│ 新节点加入集群 │
│ Pending Pod 被调度 │
└──────────┬──────────────────┘
↓
负载处理完成
Horizontal Pod Autoscaler (HPA)
基于 CPU 的 HPA
# user-service-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-service
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
minReplicas: 3
maxReplicas: 20
metrics:
# CPU 利用率
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 内存利用率
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# 扩缩容行为
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
基于自定义指标的 HPA
安装 Prometheus Adapter:
#!/bin/bash
# install-prometheus-adapter.sh
NAMESPACE="monitoring"
echo "安装 Prometheus Adapter..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
cat > prometheus-adapter-values.yaml << 'EOF'
prometheus:
url: http://prometheus-stack-kube-prom-prometheus.monitoring.svc
port: 9090
rules:
default: false
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
- seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_bucket"
as: "${1}"
metricsQuery: 'histogram_quantile(0.95, sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (le, <<.GroupBy>>))'
EOF
helm install prometheus-adapter prometheus-community/prometheus-adapter \
-n $NAMESPACE \
-f prometheus-adapter-values.yaml
echo "✓ Prometheus Adapter 已安装"
基于自定义指标的 HPA:
# custom-metrics-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-custom
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 3
maxReplicas: 30
metrics:
# CPU 基线
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 自定义指标:每秒请求数
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
# 自定义指标:P95 响应时间
- type: Pods
pods:
metric:
name: http_request_duration_seconds
target:
type: AverageValue
averageValue: "500m"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 30
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 60
selectPolicy: Min
基于 KEDA 的事件驱动扩展
安装 KEDA:
#!/bin/bash
# install-keda.sh
echo "安装 KEDA..."
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
echo "✓ KEDA 已安装"
基于 SQS 队列长度扩展:
# keda-sqs-scaler.yaml
apiVersion: v1
kind: Secret
metadata:
name: aws-credentials
namespace: production
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"
AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 2
maxReplicaCount: 50
pollingInterval: 30
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/order-queue
queueLength: "10"
awsRegion: "us-east-1"
authenticationRef:
name: aws-credentials
基于 CloudWatch 指标扩展:
# keda-cloudwatch-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: report-generator-scaler
namespace: production
spec:
scaleTargetRef:
name: report-generator
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: aws-cloudwatch
metadata:
namespace: AWS/SQS
dimensionName: QueueName
dimensionValue: report-queue
metricName: ApproximateNumberOfMessagesVisible
targetMetricValue: "5"
minMetricValue: "0"
awsRegion: "us-east-1"
Vertical Pod Autoscaler (VPA)
安装 VPA
#!/bin/bash
# install-vpa.sh
echo "安装 Vertical Pod Autoscaler..."
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
echo "✓ VPA 已安装"
# 验证
kubectl get pods -n kube-system | grep vpa
VPA 配置示例
推荐模式(仅提供建议):
# user-service-vpa-recommend.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: user-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
updatePolicy:
updateMode: "Off" # 仅推荐,不自动更新
resourcePolicy:
containerPolicies:
- containerName: user-service
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
controlledResources:
- cpu
- memory
自动模式(自动更新):
# product-service-vpa-auto.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: product-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: product-service
updatePolicy:
updateMode: "Auto" # 自动重启 Pod 并更新资源
resourcePolicy:
containerPolicies:
- containerName: product-service
minAllowed:
cpu: 200m
memory: 256Mi
maxAllowed:
cpu: 4000m
memory: 4Gi
controlledResources:
- cpu
- memory
controlledValues: RequestsAndLimits
查看 VPA 推荐:
#!/bin/bash
# check-vpa-recommendations.sh
NAMESPACE="production"
echo "VPA 推荐配置:"
echo ""
kubectl get vpa -n $NAMESPACE -o custom-columns=\
NAME:.metadata.name,\
CPU_REQUEST:.status.recommendation.containerRecommendations[0].target.cpu,\
MEMORY_REQUEST:.status.recommendation.containerRecommendations[0].target.memory,\
UPDATE_MODE:.spec.updatePolicy.updateMode
Cluster Autoscaler
安装 Cluster Autoscaler
#!/bin/bash
# install-cluster-autoscaler.sh
CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "================================================"
echo "安装 Cluster Autoscaler"
echo "================================================"
# 1. 创建 IAM 策略
echo ""
echo "1. 创建 IAM 策略..."
cat > cluster-autoscaler-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": ["*"]
}
]
}
EOF
aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \
--policy-document file://cluster-autoscaler-policy.json
POLICY_ARN="arn:aws:iam::${ACCOUNT_ID}:policy/AmazonEKSClusterAutoscalerPolicy"
# 2. 创建 IRSA
echo ""
echo "2. 创建 Service Account..."
eksctl create iamserviceaccount \
--cluster=$CLUSTER_NAME \
--namespace=kube-system \
--name=cluster-autoscaler \
--attach-policy-arn=$POLICY_ARN \
--override-existing-serviceaccounts \
--approve \
--region=$REGION
# 3. 部署 Cluster Autoscaler
echo ""
echo "3. 部署 Cluster Autoscaler..."
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.2
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${CLUSTER_NAME}
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 200m
memory: 600Mi
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-bundle.crt
EOF
echo ""
echo "4. 等待 Cluster Autoscaler 就绪..."
kubectl rollout status deployment/cluster-autoscaler -n kube-system
echo ""
echo "================================================"
echo "Cluster Autoscaler 安装完成!"
echo "================================================"
echo ""
echo "查看日志:"
echo " kubectl logs -f deployment/cluster-autoscaler -n kube-system"
echo "================================================"
rm -f cluster-autoscaler-policy.json
Cluster Autoscaler 配置优化
# cluster-autoscaler-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |
10:
- .*-spot-.*
50:
- .*-general-.*
100:
- .*-compute-.*
灾难恢复
备份策略架构
┌─────────────────────────────────────────────────────┐
│ 备份层次 │
│ │
│ 1. Kubernetes 资源 │
│ ├─ Namespace │
│ ├─ ConfigMap / Secret │
│ ├─ Deployment / StatefulSet / DaemonSet │
│ ├─ Service / Ingress │
│ └─ PVC / PV │
│ │
│ 2. 持久化数据 │
│ ├─ EBS Volumes (PV) │
│ ├─ RDS Snapshots │
│ ├─ Redis Backups │
│ └─ S3 Objects │
│ │
│ 3. 应用数据 │
│ ├─ Database Dumps │
│ ├─ Configuration Files │
│ └─ Application State │
│ │
└─────────────────┬───────────────────────────────────┘
↓
Velero + AWS Backup
↓
S3 Bucket (跨区域复制)
安装 Velero
#!/bin/bash
# install-velero.sh
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET="production-velero-backups-${ACCOUNT_ID}"
CLUSTER_NAME="production-eks-cluster"
echo "================================================"
echo "安装 Velero"
echo "================================================"
# 1. 创建 S3 Bucket
echo ""
echo "1. 创建 Velero S3 Bucket..."
aws s3api create-bucket \
--bucket $BUCKET \
--region $REGION
aws s3api put-bucket-versioning \
--bucket $BUCKET \
--versioning-configuration Status=Enabled
aws s3api put-bucket-encryption \
--bucket $BUCKET \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
# 2. 创建 IAM 策略
echo ""
echo "2. 创建 IAM 策略..."
cat > velero-policy.json << EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:s3:::${BUCKET}/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${BUCKET}"
]
}
]
}
EOF
aws iam create-policy \
--policy-name VeleroBackupPolicy \
--policy-document file://velero-policy.json
POLICY_ARN="arn:aws:iam::${ACCOUNT_ID}:policy/VeleroBackupPolicy"
# 3. 创建 IRSA
echo ""
echo "3. 创建 Service Account..."
eksctl create iamserviceaccount \
--cluster=$CLUSTER_NAME \
--namespace=velero \
--name=velero \
--attach-policy-arn=$POLICY_ARN \
--approve \
--region=$REGION
# 4. 安装 Velero CLI
echo ""
echo "4. 下载 Velero CLI..."
VELERO_VERSION="v1.12.0"
wget https://github.com/vmware-tanzu/velero/releases/download/${VELERO_VERSION}/velero-${VELERO_VERSION}-linux-amd64.tar.gz
tar -xvf velero-${VELERO_VERSION}-linux-amd64.tar.gz
sudo mv velero-${VELERO_VERSION}-linux-amd64/velero /usr/local/bin/
rm -rf velero-${VELERO_VERSION}-linux-amd64*
# 5. 安装 Velero 到集群
echo ""
echo "5. 安装 Velero 到集群..."
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket $BUCKET \
--backup-location-config region=$REGION \
--snapshot-location-config region=$REGION \
--sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/eksctl-${CLUSTER_NAME}-addon-iamserviceaccount-Role1 \
--no-secret \
--use-node-agent
echo ""
echo "6. 等待 Velero 就绪..."
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=velero \
-n velero \
--timeout=300s
echo ""
echo "================================================"
echo "Velero 安装完成!"
echo "================================================"
echo ""
echo "Backup Bucket: $BUCKET"
echo ""
echo "验证安装:"
echo " velero version"
echo " velero backup-location get"
echo "================================================"
rm -f velero-policy.json
配置备份计划
每日全量备份:
# daily-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-full-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 每天凌晨 2 点
template:
includedNamespaces:
- production
- staging
includedResources:
- '*'
excludedResources:
- events
- events.events.k8s.io
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 天保留
snapshotVolumes: true
hooks:
resources:
- name: postgres-backup
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- pg_dump -U postgres mydb > /tmp/backup.sql
onError: Continue
timeout: 10m
每小时增量备份(仅配置):
# hourly-config-backup.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: hourly-config-backup
namespace: velero
spec:
schedule: "0 * * * *" # 每小时
template:
includedNamespaces:
- production
includedResources:
- configmaps
- secrets
- deployments
- statefulsets
- services
- ingresses
storageLocation: default
ttl: 168h # 7 天保留
snapshotVolumes: false
备份操作
手动创建备份:
#!/bin/bash
# create-manual-backup.sh
BACKUP_NAME="manual-backup-$(date +%Y%m%d-%H%M%S)"
echo "创建手动备份: $BACKUP_NAME"
velero backup create $BACKUP_NAME \
--include-namespaces production \
--snapshot-volumes \
--wait
echo ""
echo "备份详情:"
velero backup describe $BACKUP_NAME
echo ""
echo "备份日志:"
velero backup logs $BACKUP_NAME
列出所有备份:
#!/bin/bash
# list-backups.sh
echo "所有备份:"
velero backup get
echo ""
echo "备份位置:"
velero backup-location get
echo ""
echo "快照位置:"
velero snapshot-location get
恢复操作
完整集群恢复:
#!/bin/bash
# restore-full-cluster.sh
BACKUP_NAME="daily-full-backup-20240111020000"
echo "================================================"
echo "执行完整集群恢复"
echo "备份: $BACKUP_NAME"
echo "================================================"
# 1. 验证备份
echo ""
echo "1. 验证备份..."
velero backup describe $BACKUP_NAME
read -p "确认恢复?(yes/no) " -r
if [[ ! $REPLY == "yes" ]]; then
echo "取消恢复"
exit 0
fi
# 2. 执行恢复
echo ""
echo "2. 执行恢复..."
RESTORE_NAME="restore-$(date +%Y%m%d-%H%M%S)"
velero restore create $RESTORE_NAME \
--from-backup $BACKUP_NAME \
--wait
# 3. 验证恢复
echo ""
echo "3. 验证恢复..."
velero restore describe $RESTORE_NAME
echo ""
echo "4. 检查 Pod 状态..."
kubectl get pods --all-namespaces
echo ""
echo "================================================"
echo "恢复完成!"
echo "================================================"
选择性恢复(特定命名空间):
#!/bin/bash
# restore-namespace.sh
BACKUP_NAME="daily-full-backup-20240111020000"
NAMESPACE="production"
echo "恢复命名空间: $NAMESPACE"
velero restore create restore-$NAMESPACE-$(date +%Y%m%d-%H%M%S) \
--from-backup $BACKUP_NAME \
--include-namespaces $NAMESPACE \
--wait
echo "✓ 恢复完成"
灾难演练
定期演练脚本:
#!/bin/bash
# disaster-recovery-drill.sh
echo "================================================"
echo "灾难恢复演练"
echo "================================================"
# 1. 创建测试备份
echo ""
echo "1. 创建测试备份..."
BACKUP_NAME="dr-drill-$(date +%Y%m%d-%H%M%S)"
velero backup create $BACKUP_NAME \
--include-namespaces production \
--wait
# 2. 在测试命名空间中恢复
echo ""
echo "2. 恢复到测试命名空间..."
velero restore create restore-drill-$(date +%Y%m%d-%H%M%S) \
--from-backup $BACKUP_NAME \
--namespace-mappings production:dr-test \
--wait
# 3. 验证应用
echo ""
echo "3. 验证应用..."
kubectl get pods -n dr-test
# 4. 测试功能
echo ""
echo "4. 执行功能测试..."
kubectl run test-pod -n dr-test --image=curlimages/curl --rm -it --restart=Never -- \
curl http://user-service:9001/health
# 5. 清理
echo ""
echo "5. 清理测试资源..."
kubectl delete namespace dr-test
velero backup delete $BACKUP_NAME --confirm
echo ""
echo "================================================"
echo "演练完成!"
echo "================================================"
最佳实践总结
1. HPA 配置
✓ 同时配置多个指标
✓ 合理设置 min/max replicas
✓ 配置扩缩容行为(stabilization)
✓ 使用自定义指标
✓ 避免与 VPA 同时作用于相同资源
2. VPA 配置
✓ 从推荐模式开始
✓ 设置合理的 min/max
✓ 对无状态应用使用自动模式
✓ 对有状态应用谨慎使用
✓ 监控 VPA 推荐的准确性
3. Cluster Autoscaler
✓ 配置多节点组
✓ 使用节点标签和污点
✓ 配置 PDB 保护关键 Pod
✓ 设置合理的扩缩容阈值
✓ 监控扩容延迟
4. 备份策略
✓ 定期自动备份(每日)
✓ 多层备份(全量 + 增量)
✓ 跨区域复制关键备份
✓ 定期测试恢复流程
✓ 文档化恢复步骤
✓ 定期执行灾难演练
5. 灾难恢复
✓ RTO 和 RPO 目标明确
✓ 多可用区部署
✓ 定期备份验证
✓ 自动化恢复流程
✓ 保留多个备份版本
✓ 监控备份作业状态
下一步: 继续学习 成本优化和最佳实践 章节。