自动扩展和容灾

本章节详细介绍 Kubernetes 集群的自动扩展策略和灾难恢复方案，包括 HPA、VPA、Cluster Autoscaler、KEDA 以及 Velero 备份恢复。

自动扩展架构

三层扩展策略

┌─────────────────────────────────────────────────────┐
│         应用层扩展（Horizontal Pod Autoscaler）      │
│  ┌──────────────────────────────────────────────┐  │
│  │  根据指标自动调整 Pod 副本数                  │  │
│  │  ├─ CPU 利用率                                │  │
│  │  ├─ 内存利用率                                │  │
│  │  ├─ 自定义指标（QPS、队列长度等）             │  │
│  │  └─ 外部指标（SQS、CloudWatch）              │  │
│  └──────────────────────────────────────────────┘  │
└───────────────────┬─────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────┐
│          Pod 资源优化（Vertical Pod Autoscaler）     │
│  ┌──────────────────────────────────────────────┐  │
│  │  自动调整 Pod 的 requests 和 limits           │  │
│  │  ├─ 分析历史资源使用                          │  │
│  │  ├─ 推荐合理的资源配置                        │  │
│  │  └─ 自动更新（可选）                         │  │
│  └──────────────────────────────────────────────┘  │
└───────────────────┬─────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────┐
│         节点层扩展（Cluster Autoscaler）             │
│  ┌──────────────────────────────────────────────┐  │
│  │  根据 Pod 调度需求自动调整节点数量            │  │
│  │  ├─ Pod pending 时增加节点                   │  │
│  │  ├─ 节点利用率低时减少节点                    │  │
│  │  ├─ 支持多节点组                              │  │
│  │  └─ 尊重 PDB 和 Pod 优先级                   │  │
│  └──────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

扩展流程

用户请求增加 ──────────────────────────────┐
                                          ↓
                                    负载增加
                                          ↓
                        ┌─────────────────────────────┐
                        │  HPA 检测到指标超过阈值      │
                        │  (CPU > 70% 或 自定义指标)   │
                        └──────────┬──────────────────┘
                                   ↓
                        ┌─────────────────────────────┐
                        │  HPA 增加 Pod 副本数         │
                        │  (例如：3 → 6)               │
                        └──────────┬──────────────────┘
                                   ↓
                        ┌─────────────────────────────┐
                        │  某些 Pod 处于 Pending 状态  │
                        │  (节点资源不足)              │
                        └──────────┬──────────────────┘
                                   ↓
                        ┌─────────────────────────────┐
                        │  Cluster Autoscaler 检测到   │
                        │  有 Pending 的 Pod           │
                        └──────────┬──────────────────┘
                                   ↓
                        ┌─────────────────────────────┐
                        │  Cluster Autoscaler 增加节点 │
                        │  (向 AWS ASG 请求扩容)       │
                        └──────────┬──────────────────┘
                                   ↓
                        ┌─────────────────────────────┐
                        │  新节点加入集群              │
                        │  Pending Pod 被调度          │
                        └──────────┬──────────────────┘
                                   ↓
                              负载处理完成

Horizontal Pod Autoscaler (HPA)

基于 CPU 的 HPA

# user-service-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: user-service
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  
  minReplicas: 3
  maxReplicas: 20
  
  metrics:
  # CPU 利用率
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  # 内存利用率
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  
  # 扩缩容行为
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
      - type: Pods
        value: 4
        periodSeconds: 60
      selectPolicy: Max
    
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      - type: Pods
        value: 2
        periodSeconds: 60
      selectPolicy: Min

基于自定义指标的 HPA

安装 Prometheus Adapter：

#!/bin/bash
# install-prometheus-adapter.sh

NAMESPACE="monitoring"

echo "安装 Prometheus Adapter..."

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

cat > prometheus-adapter-values.yaml << 'EOF'
prometheus:
  url: http://prometheus-stack-kube-prom-prometheus.monitoring.svc
  port: 9090

rules:
  default: false
  custom:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
  
  - seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_bucket"
      as: "${1}"
    metricsQuery: 'histogram_quantile(0.95, sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (le, <<.GroupBy>>))'
EOF

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  -n $NAMESPACE \
  -f prometheus-adapter-values.yaml

echo "✓ Prometheus Adapter 已安装"

基于自定义指标的 HPA：

# custom-metrics-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-custom
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  
  minReplicas: 3
  maxReplicas: 30
  
  metrics:
  # CPU 基线
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  
  # 自定义指标：每秒请求数
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  
  # 自定义指标：P95 响应时间
  - type: Pods
    pods:
      metric:
        name: http_request_duration_seconds
      target:
        type: AverageValue
        averageValue: "500m"
  
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      selectPolicy: Max
    
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
      selectPolicy: Min

基于 KEDA 的事件驱动扩展

安装 KEDA：

#!/bin/bash
# install-keda.sh

echo "安装 KEDA..."

helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda --namespace keda --create-namespace

echo "✓ KEDA 已安装"

基于 SQS 队列长度扩展：

# keda-sqs-scaler.yaml
apiVersion: v1
kind: Secret
metadata:
  name: aws-credentials
  namespace: production
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"
  AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  
  minReplicaCount: 2
  maxReplicaCount: 50
  
  pollingInterval: 30
  cooldownPeriod: 300
  
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789012/order-queue
      queueLength: "10"
      awsRegion: "us-east-1"
    authenticationRef:
      name: aws-credentials

基于 CloudWatch 指标扩展：

# keda-cloudwatch-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: report-generator-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: report-generator
  
  minReplicaCount: 1
  maxReplicaCount: 20
  
  triggers:
  - type: aws-cloudwatch
    metadata:
      namespace: AWS/SQS
      dimensionName: QueueName
      dimensionValue: report-queue
      metricName: ApproximateNumberOfMessagesVisible
      targetMetricValue: "5"
      minMetricValue: "0"
      awsRegion: "us-east-1"

Vertical Pod Autoscaler (VPA)

安装 VPA

#!/bin/bash
# install-vpa.sh

echo "安装 Vertical Pod Autoscaler..."

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler

./hack/vpa-up.sh

echo "✓ VPA 已安装"

# 验证
kubectl get pods -n kube-system | grep vpa

VPA 配置示例

推荐模式（仅提供建议）：

# user-service-vpa-recommend.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: user-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: user-service
  
  updatePolicy:
    updateMode: "Off"  # 仅推荐，不自动更新
  
  resourcePolicy:
    containerPolicies:
    - containerName: user-service
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2000m
        memory: 2Gi
      controlledResources:
      - cpu
      - memory

自动模式（自动更新）：

# product-service-vpa-auto.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: product-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: product-service
  
  updatePolicy:
    updateMode: "Auto"  # 自动重启 Pod 并更新资源
  
  resourcePolicy:
    containerPolicies:
    - containerName: product-service
      minAllowed:
        cpu: 200m
        memory: 256Mi
      maxAllowed:
        cpu: 4000m
        memory: 4Gi
      controlledResources:
      - cpu
      - memory
      controlledValues: RequestsAndLimits

查看 VPA 推荐：

#!/bin/bash
# check-vpa-recommendations.sh

NAMESPACE="production"

echo "VPA 推荐配置："
echo ""

kubectl get vpa -n $NAMESPACE -o custom-columns=\
NAME:.metadata.name,\
CPU_REQUEST:.status.recommendation.containerRecommendations[0].target.cpu,\
MEMORY_REQUEST:.status.recommendation.containerRecommendations[0].target.memory,\
UPDATE_MODE:.spec.updatePolicy.updateMode

Cluster Autoscaler

安装 Cluster Autoscaler

#!/bin/bash
# install-cluster-autoscaler.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "================================================"
echo "安装 Cluster Autoscaler"
echo "================================================"

# 1. 创建 IAM 策略
echo ""
echo "1. 创建 IAM 策略..."
cat > cluster-autoscaler-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "ec2:DescribeImages",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": ["*"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup"
      ],
      "Resource": ["*"]
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name AmazonEKSClusterAutoscalerPolicy \
  --policy-document file://cluster-autoscaler-policy.json

POLICY_ARN="arn:aws:iam::${ACCOUNT_ID}:policy/AmazonEKSClusterAutoscalerPolicy"

# 2. 创建 IRSA
echo ""
echo "2. 创建 Service Account..."
eksctl create iamserviceaccount \
  --cluster=$CLUSTER_NAME \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=$POLICY_ARN \
  --override-existing-serviceaccounts \
  --approve \
  --region=$REGION

# 3. 部署 Cluster Autoscaler
echo ""
echo "3. 部署 Cluster Autoscaler..."
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.2
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/${CLUSTER_NAME}
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        resources:
          requests:
            cpu: 100m
            memory: 300Mi
          limits:
            cpu: 200m
            memory: 600Mi
        volumeMounts:
        - name: ssl-certs
          mountPath: /etc/ssl/certs/ca-certificates.crt
          readOnly: true
      volumes:
      - name: ssl-certs
        hostPath:
          path: /etc/ssl/certs/ca-bundle.crt
EOF

echo ""
echo "4. 等待 Cluster Autoscaler 就绪..."
kubectl rollout status deployment/cluster-autoscaler -n kube-system

echo ""
echo "================================================"
echo "Cluster Autoscaler 安装完成！"
echo "================================================"
echo ""
echo "查看日志："
echo "  kubectl logs -f deployment/cluster-autoscaler -n kube-system"
echo "================================================"

rm -f cluster-autoscaler-policy.json

Cluster Autoscaler 配置优化

# cluster-autoscaler-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |
    10:
      - .*-spot-.*
    50:
      - .*-general-.*
    100:
      - .*-compute-.*

灾难恢复

备份策略架构

┌─────────────────────────────────────────────────────┐
│              备份层次                                │
│                                                      │
│  1. Kubernetes 资源                                  │
│     ├─ Namespace                                    │
│     ├─ ConfigMap / Secret                           │
│     ├─ Deployment / StatefulSet / DaemonSet         │
│     ├─ Service / Ingress                            │
│     └─ PVC / PV                                     │
│                                                      │
│  2. 持久化数据                                       │
│     ├─ EBS Volumes (PV)                             │
│     ├─ RDS Snapshots                                │
│     ├─ Redis Backups                                │
│     └─ S3 Objects                                   │
│                                                      │
│  3. 应用数据                                         │
│     ├─ Database Dumps                               │
│     ├─ Configuration Files                          │
│     └─ Application State                            │
│                                                      │
└─────────────────┬───────────────────────────────────┘
                  ↓
         Velero + AWS Backup
                  ↓
         S3 Bucket (跨区域复制)

安装 Velero

#!/bin/bash
# install-velero.sh

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET="production-velero-backups-${ACCOUNT_ID}"
CLUSTER_NAME="production-eks-cluster"

echo "================================================"
echo "安装 Velero"
echo "================================================"

# 1. 创建 S3 Bucket
echo ""
echo "1. 创建 Velero S3 Bucket..."
aws s3api create-bucket \
  --bucket $BUCKET \
  --region $REGION

aws s3api put-bucket-versioning \
  --bucket $BUCKET \
  --versioning-configuration Status=Enabled

aws s3api put-bucket-encryption \
  --bucket $BUCKET \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "AES256"
      }
    }]
  }'

# 2. 创建 IAM 策略
echo ""
echo "2. 创建 IAM 策略..."
cat > velero-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": [
        "arn:aws:s3:::${BUCKET}/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${BUCKET}"
      ]
    }
  ]
}
EOF

aws iam create-policy \
  --policy-name VeleroBackupPolicy \
  --policy-document file://velero-policy.json

POLICY_ARN="arn:aws:iam::${ACCOUNT_ID}:policy/VeleroBackupPolicy"

# 3. 创建 IRSA
echo ""
echo "3. 创建 Service Account..."
eksctl create iamserviceaccount \
  --cluster=$CLUSTER_NAME \
  --namespace=velero \
  --name=velero \
  --attach-policy-arn=$POLICY_ARN \
  --approve \
  --region=$REGION

# 4. 安装 Velero CLI
echo ""
echo "4. 下载 Velero CLI..."
VELERO_VERSION="v1.12.0"
wget https://github.com/vmware-tanzu/velero/releases/download/${VELERO_VERSION}/velero-${VELERO_VERSION}-linux-amd64.tar.gz
tar -xvf velero-${VELERO_VERSION}-linux-amd64.tar.gz
sudo mv velero-${VELERO_VERSION}-linux-amd64/velero /usr/local/bin/
rm -rf velero-${VELERO_VERSION}-linux-amd64*

# 5. 安装 Velero 到集群
echo ""
echo "5. 安装 Velero 到集群..."
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket $BUCKET \
  --backup-location-config region=$REGION \
  --snapshot-location-config region=$REGION \
  --sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/eksctl-${CLUSTER_NAME}-addon-iamserviceaccount-Role1 \
  --no-secret \
  --use-node-agent

echo ""
echo "6. 等待 Velero 就绪..."
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=velero \
  -n velero \
  --timeout=300s

echo ""
echo "================================================"
echo "Velero 安装完成！"
echo "================================================"
echo ""
echo "Backup Bucket: $BUCKET"
echo ""
echo "验证安装："
echo "  velero version"
echo "  velero backup-location get"
echo "================================================"

rm -f velero-policy.json

配置备份计划

每日全量备份：

# daily-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-full-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 每天凌晨 2 点
  template:
    includedNamespaces:
    - production
    - staging
    
    includedResources:
    - '*'
    
    excludedResources:
    - events
    - events.events.k8s.io
    
    storageLocation: default
    
    volumeSnapshotLocations:
    - default
    
    ttl: 720h  # 30 天保留
    
    snapshotVolumes: true
    
    hooks:
      resources:
      - name: postgres-backup
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgres
        pre:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - pg_dump -U postgres mydb > /tmp/backup.sql
            onError: Continue
            timeout: 10m

每小时增量备份（仅配置）：

# hourly-config-backup.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-config-backup
  namespace: velero
spec:
  schedule: "0 * * * *"  # 每小时
  template:
    includedNamespaces:
    - production
    
    includedResources:
    - configmaps
    - secrets
    - deployments
    - statefulsets
    - services
    - ingresses
    
    storageLocation: default
    
    ttl: 168h  # 7 天保留
    
    snapshotVolumes: false

备份操作

手动创建备份：

#!/bin/bash
# create-manual-backup.sh

BACKUP_NAME="manual-backup-$(date +%Y%m%d-%H%M%S)"

echo "创建手动备份: $BACKUP_NAME"

velero backup create $BACKUP_NAME \
  --include-namespaces production \
  --snapshot-volumes \
  --wait

echo ""
echo "备份详情："
velero backup describe $BACKUP_NAME

echo ""
echo "备份日志："
velero backup logs $BACKUP_NAME

列出所有备份：

#!/bin/bash
# list-backups.sh

echo "所有备份："
velero backup get

echo ""
echo "备份位置："
velero backup-location get

echo ""
echo "快照位置："
velero snapshot-location get

恢复操作

完整集群恢复：

#!/bin/bash
# restore-full-cluster.sh

BACKUP_NAME="daily-full-backup-20240111020000"

echo "================================================"
echo "执行完整集群恢复"
echo "备份: $BACKUP_NAME"
echo "================================================"

# 1. 验证备份
echo ""
echo "1. 验证备份..."
velero backup describe $BACKUP_NAME

read -p "确认恢复？(yes/no) " -r
if [[ ! $REPLY == "yes" ]]; then
  echo "取消恢复"
  exit 0
fi

# 2. 执行恢复
echo ""
echo "2. 执行恢复..."
RESTORE_NAME="restore-$(date +%Y%m%d-%H%M%S)"

velero restore create $RESTORE_NAME \
  --from-backup $BACKUP_NAME \
  --wait

# 3. 验证恢复
echo ""
echo "3. 验证恢复..."
velero restore describe $RESTORE_NAME

echo ""
echo "4. 检查 Pod 状态..."
kubectl get pods --all-namespaces

echo ""
echo "================================================"
echo "恢复完成！"
echo "================================================"

选择性恢复（特定命名空间）：

#!/bin/bash
# restore-namespace.sh

BACKUP_NAME="daily-full-backup-20240111020000"
NAMESPACE="production"

echo "恢复命名空间: $NAMESPACE"

velero restore create restore-$NAMESPACE-$(date +%Y%m%d-%H%M%S) \
  --from-backup $BACKUP_NAME \
  --include-namespaces $NAMESPACE \
  --wait

echo "✓ 恢复完成"

灾难演练

定期演练脚本：

#!/bin/bash
# disaster-recovery-drill.sh

echo "================================================"
echo "灾难恢复演练"
echo "================================================"

# 1. 创建测试备份
echo ""
echo "1. 创建测试备份..."
BACKUP_NAME="dr-drill-$(date +%Y%m%d-%H%M%S)"
velero backup create $BACKUP_NAME \
  --include-namespaces production \
  --wait

# 2. 在测试命名空间中恢复
echo ""
echo "2. 恢复到测试命名空间..."
velero restore create restore-drill-$(date +%Y%m%d-%H%M%S) \
  --from-backup $BACKUP_NAME \
  --namespace-mappings production:dr-test \
  --wait

# 3. 验证应用
echo ""
echo "3. 验证应用..."
kubectl get pods -n dr-test

# 4. 测试功能
echo ""
echo "4. 执行功能测试..."
kubectl run test-pod -n dr-test --image=curlimages/curl --rm -it --restart=Never -- \
  curl http://user-service:9001/health

# 5. 清理
echo ""
echo "5. 清理测试资源..."
kubectl delete namespace dr-test
velero backup delete $BACKUP_NAME --confirm

echo ""
echo "================================================"
echo "演练完成！"
echo "================================================"

最佳实践总结

1. HPA 配置

✓ 同时配置多个指标
✓ 合理设置 min/max replicas
✓ 配置扩缩容行为（stabilization）
✓ 使用自定义指标
✓ 避免与 VPA 同时作用于相同资源

2. VPA 配置

✓ 从推荐模式开始
✓ 设置合理的 min/max
✓ 对无状态应用使用自动模式
✓ 对有状态应用谨慎使用
✓ 监控 VPA 推荐的准确性

3. Cluster Autoscaler

✓ 配置多节点组
✓ 使用节点标签和污点
✓ 配置 PDB 保护关键 Pod
✓ 设置合理的扩缩容阈值
✓ 监控扩容延迟

4. 备份策略

✓ 定期自动备份（每日）
✓ 多层备份（全量 + 增量）
✓ 跨区域复制关键备份
✓ 定期测试恢复流程
✓ 文档化恢复步骤
✓ 定期执行灾难演练

5. 灾难恢复

✓ RTO 和 RPO 目标明确
✓ 多可用区部署
✓ 定期备份验证
✓ 自动化恢复流程
✓ 保留多个备份版本
✓ 监控备份作业状态

下一步： 继续学习成本优化和最佳实践章节。

监控和日志系统

成本优化和最佳实践