灰度发布与金丝雀部署
灰度发布与金丝雀部署
什么是灰度发布
灰度发布(Gray Release)是一种平滑过渡的发布策略,让新版本先对部分用户生效,验证无误后再全量发布。
主要策略:
- 金丝雀发布(Canary):小流量验证
- 蓝绿部署(Blue-Green):环境切换
- A/B 测试:按用户特征分流
- 滚动发布(Rolling Update):逐步替换
金丝雀发布原理
┌─────────────────────────────────┐
│ 用户流量 100% │
└──────────┬──────────────────────┘
│
▼
┌──────────────┐
│ Service │
└──────┬───────┘
│
┌────┴────┐
│ │
90% │ │ 10%
▼ ▼
┌──────────┐ ┌──────────┐
│ 稳定版本 │ │ 金丝雀版本│
│ v1.0 │ │ v2.0 │
└──────────┘ └──────────┘
核心思想:
- 部署少量新版本实例(金丝雀)
- 导入小部分流量验证
- 监控指标,发现问题快速回滚
- 验证通过后逐步扩大流量
- 最终全量切换到新版本
方式 1:基于 Deployment 的金丝雀发布
1. 部署稳定版本
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
labels:
app: myapp
version: stable
spec:
replicas: 9
selector:
matchLabels:
app: myapp
version: stable
template:
metadata:
labels:
app: myapp
version: stable
spec:
containers:
- name: myapp
image: myapp:v1.0
ports:
- containerPort: 8080
2. 部署金丝雀版本
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
labels:
app: myapp
version: canary
spec:
replicas: 1 # 10% 流量
selector:
matchLabels:
app: myapp
version: canary
template:
metadata:
labels:
app: myapp
version: canary
spec:
containers:
- name: myapp
image: myapp:v2.0 # 新版本
ports:
- containerPort: 8080
3. Service 配置
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # 同时选中 stable 和 canary
ports:
- port: 80
targetPort: 8080
4. 监控验证
# 监控 Pod 状态
kubectl get pods -l app=myapp
# 查看日志
kubectl logs -l version=canary --tail=100 -f
# 检查错误率
kubectl top pods -l app=myapp
# 访问测试
for i in {1..100}; do
curl http://myapp-service
sleep 0.1
done
5. 流量调整
# 增加金丝雀流量到 50%
kubectl scale deployment myapp-canary --replicas=5
kubectl scale deployment myapp-stable --replicas=5
# 全量切换
kubectl scale deployment myapp-canary --replicas=10
kubectl scale deployment myapp-stable --replicas=0
# 清理旧版本
kubectl delete deployment myapp-stable
kubectl patch deployment myapp-canary -p '{"metadata":{"name":"myapp-stable"}}'
方式 2:使用 Ingress 实现流量分配
1. 创建两个 Service
# 稳定版本 Service
apiVersion: v1
kind: Service
metadata:
name: myapp-stable
spec:
selector:
app: myapp
version: stable
ports:
- port: 80
targetPort: 8080
---
# 金丝雀版本 Service
apiVersion: v1
kind: Service
metadata:
name: myapp-canary
spec:
selector:
app: myapp
version: canary
ports:
- port: 80
targetPort: 8080
2. Ingress 流量分配
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
annotations:
# Nginx Ingress 金丝雀注解
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% 流量
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-stable
port:
number: 80
---
# 金丝雀 Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary-ingress
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
3. 基于请求头的金丝雀
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary-header
annotations:
nginx.ingress.kubernetes.io/canary: "true"
# 特定请求头的用户访问金丝雀
nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
nginx.ingress.kubernetes.io/canary-by-header-value: "true"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
测试:
# 普通请求 -> 稳定版本
curl http://myapp.example.com
# 带特定 Header -> 金丝雀版本
curl -H "X-Canary: true" http://myapp.example.com
4. 基于 Cookie 的金丝雀
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary-cookie
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-by-cookie: "canary-user"
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
方式 3:使用 Flagger 自动化金丝雀发布
1. 安装 Flagger
# 添加 Flagger Helm 仓库
helm repo add flagger https://flagger.app
# 安装 Flagger(配合 Istio)
helm upgrade -i flagger flagger/flagger \
--namespace istio-system \
--set meshProvider=istio \
--set metricsServer=http://prometheus:9090
# 或配合 Nginx Ingress
helm upgrade -i flagger flagger/flagger \
--namespace ingress-nginx \
--set meshProvider=nginx
2. 定义 Canary 资源
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
# 目标 Deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# Service 配置
service:
port: 80
targetPort: 8080
# 金丝雀分析配置
analysis:
# 检查间隔
interval: 1m
# 阈值:连续成功次数
threshold: 5
# 最大权重
maxWeight: 50
# 权重步进
stepWeight: 10
# 指标检查
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # 成功率不低于 99%
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # 响应时间不超过 500ms
interval: 1m
# Webhook 测试
webhooks:
- name: load-test
url: http://flagger-loadtester/
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://myapp/"
3. 触发发布
# 更新镜像触发金丝雀发布
kubectl set image deployment/myapp \
myapp=myapp:v2.0
# 监控发布过程
watch kubectl get canary myapp
# 查看事件
kubectl describe canary myapp
4. Flagger 发布流程
1. 检测到新版本
↓
2. 创建金丝雀 Pod (10% 流量)
↓
3. 运行负载测试
↓
4. 检查指标(成功率、延迟等)
↓
5. 指标正常 -> 增加流量 (20%, 30%...)
指标异常 -> 自动回滚
↓
6. 达到 100% -> 升级 Primary
↓
7. 金丝雀发布完成
蓝绿部署
1. 原理
Service
│
┌────┴────┐
│ Selector │
└────┬─────┘
│
┌──────┴──────┐
│ │
version=blue version=green
│ │
┌────▼────┐ ┌───▼─────┐
│ Blue v1 │ │Green v2 │
└─────────┘ └─────────┘
2. 部署配置
# 蓝色环境(当前生产)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1.0
---
# 绿色环境(新版本)
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v2.0
---
# Service(指向蓝色)
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # 当前指向蓝色
ports:
- port: 80
targetPort: 8080
3. 切换流量
# 切换到绿色环境
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'
# 验证无误后,删除蓝色环境
kubectl delete deployment myapp-blue
# 回滚(如果有问题)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'
最佳实践
1. 监控指标
# 关键指标
- 错误率(Error Rate)
- 响应时间(Latency P50/P90/P99)
- 请求量(QPS/TPS)
- 资源使用(CPU/Memory)
- 业务指标(订单成功率等)
2. 金丝雀发布检查清单
✓ 部署前
- 代码审查通过
- 单元测试通过
- 集成测试通过
- 准备回滚方案
✓ 金丝雀阶段
- 从 1-5% 流量开始
- 监控关键指标 15-30 分钟
- 查看错误日志
- 运行冒烟测试
✓ 扩大流量
- 逐步增加:10% -> 25% -> 50% -> 100%
- 每个阶段观察 15-30 分钟
- 设置自动告警
✓ 全量发布
- 确认所有指标正常
- 记录发布信息
- 保留旧版本 1-7 天备用
3. 自动回滚策略
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
analysis:
# 自动回滚条件
metrics:
- name: error-rate
thresholdRange:
max: 1 # 错误率超过 1% 回滚
- name: latency-p99
thresholdRange:
max: 1000 # P99 延迟超过 1s 回滚
# 告警
alerts:
- name: slack
severity: error
providerRef:
name: slack-webhook
4. 流量管理策略
# 渐进式流量切换
1% -> 观察 10 分钟
5% -> 观察 15 分钟
10% -> 观察 30 分钟
25% -> 观察 30 分钟
50% -> 观察 1 小时
100% -> 全量发布
5. A/B 测试场景
# 基于用户 ID 分流
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ab-test
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-by-header: "X-User-ID"
nginx.ingress.kubernetes.io/canary-by-header-pattern: "[0-4]$" # 用户 ID 末位 0-4
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
故障排查
1. 金丝雀版本错误率高
# 查看金丝雀 Pod 日志
kubectl logs -l version=canary --tail=100
# 检查资源使用
kubectl top pods -l version=canary
# 立即回滚
kubectl scale deployment myapp-canary --replicas=0
2. 流量分配不均
# 检查 Service Endpoints
kubectl get endpoints myapp
# 验证 Pod 标签
kubectl get pods --show-labels
# 测试流量分配
for i in {1..100}; do
curl -s http://myapp | grep version
done | sort | uniq -c
3. Ingress 金丝雀不生效
# 检查 Ingress 配置
kubectl describe ingress myapp-canary
# 查看 Ingress Controller 日志
kubectl logs -n ingress-nginx -l app.kubernetes.io/component=controller
# 验证注解
kubectl get ingress myapp-canary -o yaml | grep canary
总结
灰度发布策略对比:
| 策略 | 优点 | 缺点 | 适用场景 |
|---|---|---|---|
| 金丝雀 | 风险小,可逐步验证 | 需要监控支持 | 大部分场景 |
| 蓝绿 | 切换快,回滚容易 | 资源消耗大 | 重要系统 |
| 滚动更新 | 资源利用率高 | 回滚慢 | 无状态应用 |
| A/B 测试 | 精确控制用户群 | 实现复杂 | 功能验证 |
选择建议:
- 小型应用:使用 Deployment + Service 的简单金丝雀
- 中型应用:使用 Ingress 实现流量控制
- 大型应用:使用 Flagger 或 Istio 自动化发布