灰度发布与金丝雀部署

灰度发布与金丝雀部署

什么是灰度发布

灰度发布(Gray Release)是一种平滑过渡的发布策略,让新版本先对部分用户生效,验证无误后再全量发布。

主要策略

  • 金丝雀发布(Canary):小流量验证
  • 蓝绿部署(Blue-Green):环境切换
  • A/B 测试:按用户特征分流
  • 滚动发布(Rolling Update):逐步替换

金丝雀发布原理

┌─────────────────────────────────┐
│      用户流量 100%              │
└──────────┬──────────────────────┘
           │
           ▼
    ┌──────────────┐
    │   Service    │
    └──────┬───────┘
           │
      ┌────┴────┐
      │         │
   90% │         │ 10%
      ▼         ▼
┌──────────┐ ┌──────────┐
│ 稳定版本  │ │ 金丝雀版本│
│ v1.0     │ │ v2.0     │
└──────────┘ └──────────┘

核心思想

  1. 部署少量新版本实例(金丝雀)
  2. 导入小部分流量验证
  3. 监控指标,发现问题快速回滚
  4. 验证通过后逐步扩大流量
  5. 最终全量切换到新版本

方式 1:基于 Deployment 的金丝雀发布

1. 部署稳定版本

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
  labels:
    app: myapp
    version: stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: myapp
      version: stable
  template:
    metadata:
      labels:
        app: myapp
        version: stable
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0
        ports:
        - containerPort: 8080

2. 部署金丝雀版本

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
  labels:
    app: myapp
    version: canary
spec:
  replicas: 1  # 10% 流量
  selector:
    matchLabels:
      app: myapp
      version: canary
  template:
    metadata:
      labels:
        app: myapp
        version: canary
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0  # 新版本
        ports:
        - containerPort: 8080

3. Service 配置

apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp  # 同时选中 stable 和 canary
  ports:
  - port: 80
    targetPort: 8080

4. 监控验证

# 监控 Pod 状态
kubectl get pods -l app=myapp

# 查看日志
kubectl logs -l version=canary --tail=100 -f

# 检查错误率
kubectl top pods -l app=myapp

# 访问测试
for i in {1..100}; do 
  curl http://myapp-service
  sleep 0.1
done

5. 流量调整

# 增加金丝雀流量到 50%
kubectl scale deployment myapp-canary --replicas=5
kubectl scale deployment myapp-stable --replicas=5

# 全量切换
kubectl scale deployment myapp-canary --replicas=10
kubectl scale deployment myapp-stable --replicas=0

# 清理旧版本
kubectl delete deployment myapp-stable
kubectl patch deployment myapp-canary -p '{"metadata":{"name":"myapp-stable"}}'

方式 2:使用 Ingress 实现流量分配

1. 创建两个 Service

# 稳定版本 Service
apiVersion: v1
kind: Service
metadata:
  name: myapp-stable
spec:
  selector:
    app: myapp
    version: stable
  ports:
  - port: 80
    targetPort: 8080

---
# 金丝雀版本 Service
apiVersion: v1
kind: Service
metadata:
  name: myapp-canary
spec:
  selector:
    app: myapp
    version: canary
  ports:
  - port: 80
    targetPort: 8080

2. Ingress 流量分配

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  annotations:
    # Nginx Ingress 金丝雀注解
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"  # 10% 流量
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-stable
            port:
              number: 80

---
# 金丝雀 Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary-ingress
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

3. 基于请求头的金丝雀

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary-header
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    # 特定请求头的用户访问金丝雀
    nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
    nginx.ingress.kubernetes.io/canary-by-header-value: "true"
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

测试:

# 普通请求 -> 稳定版本
curl http://myapp.example.com

# 带特定 Header -> 金丝雀版本
curl -H "X-Canary: true" http://myapp.example.com

4. 基于 Cookie 的金丝雀

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary-cookie
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-by-cookie: "canary-user"
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

方式 3:使用 Flagger 自动化金丝雀发布

1. 安装 Flagger

# 添加 Flagger Helm 仓库
helm repo add flagger https://flagger.app

# 安装 Flagger(配合 Istio)
helm upgrade -i flagger flagger/flagger \
  --namespace istio-system \
  --set meshProvider=istio \
  --set metricsServer=http://prometheus:9090

# 或配合 Nginx Ingress
helm upgrade -i flagger flagger/flagger \
  --namespace ingress-nginx \
  --set meshProvider=nginx

2. 定义 Canary 资源

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  # 目标 Deployment
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  
  # Service 配置
  service:
    port: 80
    targetPort: 8080
  
  # 金丝雀分析配置
  analysis:
    # 检查间隔
    interval: 1m
    # 阈值:连续成功次数
    threshold: 5
    # 最大权重
    maxWeight: 50
    # 权重步进
    stepWeight: 10
    
    # 指标检查
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99  # 成功率不低于 99%
      interval: 1m
    
    - name: request-duration
      thresholdRange:
        max: 500  # 响应时间不超过 500ms
      interval: 1m
    
    # Webhook 测试
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://myapp/"

3. 触发发布

# 更新镜像触发金丝雀发布
kubectl set image deployment/myapp \
  myapp=myapp:v2.0

# 监控发布过程
watch kubectl get canary myapp

# 查看事件
kubectl describe canary myapp

4. Flagger 发布流程

1. 检测到新版本
   ↓
2. 创建金丝雀 Pod (10% 流量)
   ↓
3. 运行负载测试
   ↓
4. 检查指标(成功率、延迟等)
   ↓
5. 指标正常 -> 增加流量 (20%, 30%...)
   指标异常 -> 自动回滚
   ↓
6. 达到 100% -> 升级 Primary
   ↓
7. 金丝雀发布完成

蓝绿部署

1. 原理

            Service
               │
          ┌────┴────┐
          │ Selector │
          └────┬─────┘
               │
        ┌──────┴──────┐
        │             │
    version=blue  version=green
        │             │
   ┌────▼────┐   ┌───▼─────┐
   │ Blue v1 │   │Green v2 │
   └─────────┘   └─────────┘

2. 部署配置

# 蓝色环境(当前生产)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: myapp
        image: myapp:v1.0

---
# 绿色环境(新版本)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: myapp
        image: myapp:v2.0

---
# Service(指向蓝色)
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # 当前指向蓝色
  ports:
  - port: 80
    targetPort: 8080

3. 切换流量

# 切换到绿色环境
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# 验证无误后,删除蓝色环境
kubectl delete deployment myapp-blue

# 回滚(如果有问题)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'

最佳实践

1. 监控指标

# 关键指标
- 错误率(Error Rate)
- 响应时间(Latency P50/P90/P99)
- 请求量(QPS/TPS)
- 资源使用(CPU/Memory)
- 业务指标(订单成功率等)

2. 金丝雀发布检查清单

✓ 部署前
  - 代码审查通过
  - 单元测试通过
  - 集成测试通过
  - 准备回滚方案

✓ 金丝雀阶段
  - 从 1-5% 流量开始
  - 监控关键指标 15-30 分钟
  - 查看错误日志
  - 运行冒烟测试

✓ 扩大流量
  - 逐步增加:10% -> 25% -> 50% -> 100%
  - 每个阶段观察 15-30 分钟
  - 设置自动告警

✓ 全量发布
  - 确认所有指标正常
  - 记录发布信息
  - 保留旧版本 1-7 天备用

3. 自动回滚策略

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  analysis:
    # 自动回滚条件
    metrics:
    - name: error-rate
      thresholdRange:
        max: 1  # 错误率超过 1% 回滚
    
    - name: latency-p99
      thresholdRange:
        max: 1000  # P99 延迟超过 1s 回滚
    
    # 告警
    alerts:
    - name: slack
      severity: error
      providerRef:
        name: slack-webhook

4. 流量管理策略

# 渐进式流量切换
1%  -> 观察 10 分钟
5%  -> 观察 15 分钟
10% -> 观察 30 分钟
25% -> 观察 30 分钟
50% -> 观察 1 小时
100% -> 全量发布

5. A/B 测试场景

# 基于用户 ID 分流
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ab-test
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-by-header: "X-User-ID"
    nginx.ingress.kubernetes.io/canary-by-header-pattern: "[0-4]$"  # 用户 ID 末位 0-4
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

故障排查

1. 金丝雀版本错误率高

# 查看金丝雀 Pod 日志
kubectl logs -l version=canary --tail=100

# 检查资源使用
kubectl top pods -l version=canary

# 立即回滚
kubectl scale deployment myapp-canary --replicas=0

2. 流量分配不均

# 检查 Service Endpoints
kubectl get endpoints myapp

# 验证 Pod 标签
kubectl get pods --show-labels

# 测试流量分配
for i in {1..100}; do
  curl -s http://myapp | grep version
done | sort | uniq -c

3. Ingress 金丝雀不生效

# 检查 Ingress 配置
kubectl describe ingress myapp-canary

# 查看 Ingress Controller 日志
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller

# 验证注解
kubectl get ingress myapp-canary -o yaml | grep canary

总结

灰度发布策略对比:

策略 优点 缺点 适用场景
金丝雀 风险小,可逐步验证 需要监控支持 大部分场景
蓝绿 切换快,回滚容易 资源消耗大 重要系统
滚动更新 资源利用率高 回滚慢 无状态应用
A/B 测试 精确控制用户群 实现复杂 功能验证

选择建议:

  • 小型应用:使用 Deployment + Service 的简单金丝雀
  • 中型应用:使用 Ingress 实现流量控制
  • 大型应用:使用 Flagger 或 Istio 自动化发布