基于 Deployment 的金丝雀发布
基于 Deployment 的金丝雀发布
实现原理
使用两个独立的 Deployment,通过 Service 的标签选择器实现流量分配。
┌─────────────────────────────────┐
│ Service (app=myapp) │
└──────────┬──────────────────────┘
│ 负载均衡
┌────┴────┐
│ │
90% │ │ 10%
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Deployment │ │ Deployment │
│ myapp-stable │ │ myapp-canary │
│ replicas: 9 │ │ replicas: 1 │
│ version=v1.0 │ │ version=v2.0 │
└──────────────┘ └──────────────┘
核心机制:
- Service 通过
app=myapp选择器同时匹配两个 Deployment - 流量按 Pod 数量比例分配(9:1 = 90%:10%)
- 通过调整副本数控制流量占比
完整实战示例
1. 部署稳定版本
# myapp-stable.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
labels:
app: myapp
version: stable
spec:
replicas: 9
selector:
matchLabels:
app: myapp
version: stable
template:
metadata:
labels:
app: myapp
version: stable
spec:
containers:
- name: myapp
image: nginx:1.20 # v1.0
ports:
- containerPort: 80
env:
- name: VERSION
value: "v1.0"
# 健康检查
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 3
# 资源限制
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
应用配置:
kubectl apply -f myapp-stable.yaml
# 验证
kubectl get pods -l version=stable
kubectl get deployment myapp-stable
2. 创建 Service
# myapp-service.yaml
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # 选择所有 app=myapp 的 Pod
ports:
- port: 80
targetPort: 80
protocol: TCP
type: ClusterIP
应用配置:
kubectl apply -f myapp-service.yaml
# 测试访问
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
wget -qO- http://myapp
3. 部署金丝雀版本
# myapp-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
labels:
app: myapp
version: canary
spec:
replicas: 1 # 10% 流量 (1/(9+1) = 10%)
selector:
matchLabels:
app: myapp
version: canary
template:
metadata:
labels:
app: myapp
version: canary
annotations:
# Prometheus 监控
prometheus.io/scrape: "true"
prometheus.io/port: "80"
prometheus.io/path: "/metrics"
spec:
containers:
- name: myapp
image: nginx:1.21 # v2.0 新版本
ports:
- containerPort: 80
env:
- name: VERSION
value: "v2.0"
- name: ENVIRONMENT
value: "canary"
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 5
periodSeconds: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
部署金丝雀:
kubectl apply -f myapp-canary.yaml
# 验证部署
kubectl get pods -l app=myapp --show-labels
kubectl get pods -l version=canary
4. 验证流量分配
创建测试脚本:
# test-traffic.sh
#!/bin/bash
echo "Testing traffic distribution..."
for i in {1..100}; do
VERSION=$(kubectl run -it --rm test-$RANDOM --image=busybox --restart=Never -- \
wget -qO- http://myapp | grep -o 'v[0-9.]*' | head -1)
echo $VERSION
sleep 0.1
done | sort | uniq -c
# 预期输出类似:
# 90 v1.0 (稳定版本)
# 10 v2.0 (金丝雀版本)
执行测试:
chmod +x test-traffic.sh
./test-traffic.sh
5. 监控金丝雀版本
查看 Pod 状态
# 实时监控 Pod
watch -n 2 kubectl get pods -l app=myapp
# 查看金丝雀 Pod 日志
kubectl logs -f -l version=canary
# 查看金丝雀 Pod 事件
kubectl describe pod -l version=canary
监控资源使用
# CPU 和内存使用
kubectl top pods -l app=myapp
# 按版本查看
kubectl top pods -l version=canary
kubectl top pods -l version=stable
监控关键指标
# 错误率(假设应用暴露 /metrics 端点)
kubectl exec -it $(kubectl get pod -l version=canary -o jsonpath='{.items[0].metadata.name}') -- \
curl localhost/metrics | grep error_total
# 响应时间
kubectl exec -it $(kubectl get pod -l version=canary -o jsonpath='{.items[0].metadata.name}') -- \
curl localhost/metrics | grep request_duration_seconds
6. 流量调整策略
阶段 1:初始 10% 流量
# 当前配置
kubectl get deployment myapp-stable -o yaml | grep replicas
# replicas: 9
kubectl get deployment myapp-canary -o yaml | grep replicas
# replicas: 1
# 流量分配:90% vs 10%
观察 15-30 分钟,检查:
- Pod 状态正常
- 错误率未增加
- 响应时间在预期范围
- 无异常日志
阶段 2:增加到 25% 流量
# 调整副本数
kubectl scale deployment myapp-canary --replicas=3
kubectl scale deployment myapp-stable --replicas=9
# 现在流量分配:75% vs 25%
观察 30 分钟。
阶段 3:增加到 50% 流量
kubectl scale deployment myapp-canary --replicas=5
kubectl scale deployment myapp-stable --replicas=5
# 流量分配:50% vs 50%
观察 1 小时。
阶段 4:全量切换
# 切换到金丝雀版本
kubectl scale deployment myapp-canary --replicas=10
kubectl scale deployment myapp-stable --replicas=0
# 等待所有旧 Pod 终止
kubectl wait --for=delete pod -l version=stable --timeout=300s
# 验证
kubectl get pods -l app=myapp
阶段 5:清理旧版本
# 确认新版本稳定运行 1-3 天后
kubectl delete deployment myapp-stable
# 可选:将 canary 重命名为 stable
kubectl patch deployment myapp-canary -p \
'{"metadata":{"name":"myapp-stable"},"spec":{"selector":{"matchLabels":{"version":"stable"}},"template":{"metadata":{"labels":{"version":"stable"}}}}}'
7. 快速回滚
如果发现问题,立即回滚:
# 方案 1:删除金丝雀
kubectl delete deployment myapp-canary
# 方案 2:将金丝雀副本数设为 0
kubectl scale deployment myapp-canary --replicas=0
# 方案 3:恢复稳定版本副本数
kubectl scale deployment myapp-stable --replicas=10
# 验证
kubectl get pods -l app=myapp
自动化脚本
金丝雀发布脚本
#!/bin/bash
# canary-deploy.sh
set -e
APP_NAME="myapp"
STABLE_DEPLOYMENT="${APP_NAME}-stable"
CANARY_DEPLOYMENT="${APP_NAME}-canary"
NEW_IMAGE=$1
CANARY_REPLICAS=$2
if [ -z "$NEW_IMAGE" ] || [ -z "$CANARY_REPLICAS" ]; then
echo "Usage: $0 <new-image> <canary-replicas>"
echo "Example: $0 nginx:1.21 1"
exit 1
fi
echo "🚀 Starting canary deployment..."
echo "New image: $NEW_IMAGE"
echo "Canary replicas: $CANARY_REPLICAS"
# 1. 更新金丝雀 Deployment
echo "📦 Updating canary deployment..."
kubectl set image deployment/$CANARY_DEPLOYMENT \
${APP_NAME}=$NEW_IMAGE
# 2. 等待金丝雀 Deployment 就绪
echo "⏳ Waiting for canary rollout..."
kubectl rollout status deployment/$CANARY_DEPLOYMENT --timeout=5m
# 3. 调整副本数
echo "📊 Scaling canary to $CANARY_REPLICAS replicas..."
kubectl scale deployment/$CANARY_DEPLOYMENT --replicas=$CANARY_REPLICAS
# 4. 等待 Pod 就绪
echo "⏳ Waiting for pods to be ready..."
kubectl wait --for=condition=ready pod \
-l app=$APP_NAME,version=canary \
--timeout=2m
# 5. 验证
echo "✅ Canary deployment successful!"
echo ""
echo "Current status:"
kubectl get pods -l app=$APP_NAME --show-labels
echo ""
echo "Run './monitor-canary.sh' to monitor the deployment"
监控脚本
#!/bin/bash
# monitor-canary.sh
APP_NAME="myapp"
INTERVAL=5
while true; do
clear
echo "=== Canary Deployment Monitor ==="
echo "Time: $(date)"
echo ""
echo "📊 Pod Status:"
kubectl get pods -l app=$APP_NAME -o wide
echo ""
echo "📈 Resource Usage:"
kubectl top pods -l app=$APP_NAME 2>/dev/null || echo "Metrics not available"
echo ""
echo "🔍 Recent Events:"
kubectl get events --field-selector involvedObject.kind=Pod \
--sort-by='.lastTimestamp' | tail -5
echo ""
echo "⏱️ Next refresh in ${INTERVAL}s (Ctrl+C to exit)"
sleep $INTERVAL
done
流量切换脚本
#!/bin/bash
# switch-traffic.sh
set -e
APP_NAME="myapp"
STABLE_DEPLOYMENT="${APP_NAME}-stable"
CANARY_DEPLOYMENT="${APP_NAME}-canary"
CANARY_PERCENTAGE=$1
if [ -z "$CANARY_PERCENTAGE" ]; then
echo "Usage: $0 <percentage>"
echo "Example: $0 25 (25% to canary)"
exit 1
fi
TOTAL_REPLICAS=10
CANARY_REPLICAS=$((TOTAL_REPLICAS * CANARY_PERCENTAGE / 100))
STABLE_REPLICAS=$((TOTAL_REPLICAS - CANARY_REPLICAS))
echo "🔀 Switching traffic..."
echo "Stable: ${STABLE_REPLICAS} replicas (${STABLE_PERCENTAGE}%)"
echo "Canary: ${CANARY_REPLICAS} replicas (${CANARY_PERCENTAGE}%)"
# 调整副本数
kubectl scale deployment/$STABLE_DEPLOYMENT --replicas=$STABLE_REPLICAS
kubectl scale deployment/$CANARY_DEPLOYMENT --replicas=$CANARY_REPLICAS
# 等待调整完成
kubectl wait --for=condition=ready pod \
-l app=$APP_NAME \
--timeout=2m
echo "✅ Traffic switch completed!"
kubectl get pods -l app=$APP_NAME
回滚脚本
#!/bin/bash
# rollback-canary.sh
set -e
APP_NAME="myapp"
STABLE_DEPLOYMENT="${APP_NAME}-stable"
CANARY_DEPLOYMENT="${APP_NAME}-canary"
echo "⚠️ Rolling back canary deployment..."
# 1. 删除金丝雀 Pod
echo "🗑️ Removing canary pods..."
kubectl scale deployment/$CANARY_DEPLOYMENT --replicas=0
# 2. 恢复稳定版本
echo "♻️ Restoring stable version..."
kubectl scale deployment/$STABLE_DEPLOYMENT --replicas=10
# 3. 等待稳定版本就绪
kubectl wait --for=condition=ready pod \
-l app=$APP_NAME,version=stable \
--timeout=2m
echo "✅ Rollback completed!"
kubectl get pods -l app=$APP_NAME
最佳实践
1. 副本数规划
# 小规模应用(< 5 Pod)
稳定版本: 4
金丝雀: 1
流量比例: 80% vs 20%
# 中等规模(10-20 Pod)
稳定版本: 9
金丝雀: 1
流量比例: 90% vs 10%
# 大规模(> 50 Pod)
稳定版本: 99
金丝雀: 1
流量比例: 99% vs 1%
2. 健康检查配置
# 严格的健康检查
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 30 # 给应用足够的启动时间
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 80
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2
successThreshold: 2 # 连续成功 2 次才标记为就绪
3. 资源限制
resources:
requests:
cpu: 100m # 保证最小资源
memory: 128Mi
limits:
cpu: 500m # 防止资源耗尽
memory: 512Mi
4. 标签规范
labels:
app: myapp # 应用名称
version: canary # 版本标识
release: v2.0.1 # 具体版本号
environment: production # 环境
managed-by: deployment-script # 管理方式
故障排查
问题 1:流量分配不均
症状:
# 测试流量,发现分配不符合预期
90 v1.0
10 v2.0 # 应该是 10%,但实际更少
排查:
# 1. 检查 Pod 数量
kubectl get pods -l app=myapp --show-labels
# 2. 检查 Pod 就绪状态
kubectl get pods -l version=canary -o wide
# 3. 检查 Service Endpoints
kubectl get endpoints myapp -o yaml
解决:
- 确保金丝雀 Pod 处于 Ready 状态
- 检查 readinessProbe 是否配置正确
- 验证标签是否正确
问题 2:金丝雀 Pod 启动失败
排查:
# 查看 Pod 状态
kubectl describe pod -l version=canary
# 查看日志
kubectl logs -l version=canary --tail=100
# 查看事件
kubectl get events --field-selector involvedObject.kind=Pod
常见原因:
- 镜像拉取失败
- 资源不足
- 配置错误
- 健康检查失败
问题 3:无法回滚
症状:删除金丝雀后,服务仍然异常
原因:稳定版本副本数也被设置为 0
解决:
# 立即恢复稳定版本
kubectl scale deployment myapp-stable --replicas=10
# 强制重启
kubectl rollout restart deployment myapp-stable
总结
基于 Deployment 的金丝雀发布是最简单的实现方式:
优点:
✅ 实现简单,无需额外组件
✅ 基于 Kubernetes 原生功能
✅ 适合小规模应用
缺点:
❌ 流量控制粒度粗(受 Pod 数量限制)
❌ 需要手动操作
❌ 缺少自动化回滚
适用场景:
- 小型应用快速验证
- 团队初次尝试金丝雀发布
- 资源有限的环境
下一节将介绍使用 Ingress 实现更精确的流量控制。