Flagger 自动化金丝雀发布
Flagger 自动化金丝雀发布
Flagger 简介
Flagger 是 Weaveworks 开源的渐进式交付工具,实现全自动的金丝雀发布。
核心特性
✅ 自动化流程:自动调整流量、监控指标、决策回滚
✅ 多种网格支持:Istio、Linkerd、App Mesh、Nginx、Contour
✅ 指标驱动:基于 Prometheus 指标自动决策
✅ 通知集成:Slack、Teams、Discord 告警
✅ 负载测试:内置负载测试能力
工作流程
1. 检测到新版本
↓
2. 初始化金丝雀(创建金丝雀 Pod)
↓
3. 运行预发布测试(smoke tests)
↓
4. 开始流量切换
├─ 10% → 检查指标 → 通过
├─ 20% → 检查指标 → 通过
├─ 30% → 检查指标 → 通过
├─ 50% → 检查指标 → 失败 → 自动回滚 ❌
└─ 100% → 升级完成 ✅
↓
5. 清理金丝雀资源
安装 Flagger
方式 1:Helm 安装(推荐)
安装到 Istio
# 添加 Flagger Helm 仓库
helm repo add flagger https://flagger.app
helm repo update
# 安装 Flagger
helm upgrade -i flagger flagger/flagger \
--namespace istio-system \
--set meshProvider=istio \
--set metricsServer=http://prometheus.istio-system:9090
# 验证
kubectl -n istio-system get pods -l app.kubernetes.io/name=flagger
安装到 Nginx Ingress
helm upgrade -i flagger flagger/flagger \
--namespace ingress-nginx \
--set meshProvider=nginx \
--set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090
安装到 Linkerd
helm upgrade -i flagger flagger/flagger \
--namespace linkerd \
--set meshProvider=linkerd \
--set metricsServer=http://prometheus.linkerd-viz:9090
方式 2:Kubectl 安装
# 安装 Flagger CRDs
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/crd.yaml
# 安装 Flagger
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/deployment.yaml
安装 Flagger LoadTester
# LoadTester 用于自动化测试
helm upgrade -i flagger-loadtester flagger/loadtester \
--namespace test \
--create-namespace
# 验证
kubectl -n test get pods -l app=flagger-loadtester
基础金丝雀配置
1. 准备应用 Deployment
# myapp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: test
spec:
replicas: 2
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: nginx:1.20
ports:
- name: http
containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
应用:
kubectl create namespace test
kubectl apply -f myapp-deployment.yaml
2. 创建 Canary 资源
# myapp-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: test
spec:
# 目标 Deployment
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# 自动缩放配置(可选)
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: myapp
# Service 配置
service:
port: 80
targetPort: 80
portDiscovery: true
# 金丝雀分析
analysis:
# 检查间隔
interval: 1m
# 阈值:连续成功次数
threshold: 5
# 最大流量权重
maxWeight: 50
# 权重步进
stepWeight: 10
# 指标检查
metrics:
# 1. 请求成功率
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
# 2. 请求持续时间
- name: request-duration
thresholdRange:
max: 500
interval: 1m
# 3. 自定义 Prometheus 查询
- name: error-rate
templateRef:
name: error-rate
namespace: flagger-system
thresholdRange:
max: 1
interval: 1m
# Webhook 测试
webhooks:
# 负载测试
- name: load-test
url: http://flagger-loadtester.test/
timeout: 15s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary.test/"
# 冒烟测试
- name: smoke-test
url: http://flagger-loadtester.test/
timeout: 30s
metadata:
type: bash
cmd: |
curl -s http://myapp-canary.test/ | grep "Welcome" || exit 1
应用:
kubectl apply -f myapp-canary.yaml
# 监控状态
kubectl -n test get canary myapp -w
3. Flagger 自动创建的资源
Flagger 会自动创建:
# 1. Primary Deployment (稳定版本)
kubectl get deployment myapp-primary -n test
# 2. Canary Service
kubectl get service myapp-canary -n test
# 3. Primary Service
kubectl get service myapp -n test
# 4. VirtualService (如果使用 Istio)
kubectl get virtualservice myapp -n test
架构:
myapp (Deployment)
↓
Flagger 接管
↓
├─ myapp-primary (自动创建,稳定版本)
└─ myapp (金丝雀版本)
↓
Services:
├─ myapp (主 Service)
└─ myapp-canary (金丝雀 Service)
触发金丝雀发布
更新镜像
# 方法 1:使用 kubectl set image
kubectl -n test set image deployment/myapp \
myapp=nginx:1.21
# 方法 2:编辑 Deployment
kubectl -n test edit deployment myapp
# 修改 image: nginx:1.21
# 方法 3:使用 kubectl patch
kubectl -n test patch deployment myapp \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"myapp","image":"nginx:1.21"}]}}}}'
监控发布过程
# 实时监控
watch kubectl -n test get canary myapp
# 查看详细信息
kubectl -n test describe canary myapp
# 查看事件
kubectl -n test get events --field-selector involvedObject.name=myapp --sort-by='.lastTimestamp'
发布阶段
阶段 1: Initialized
↓
阶段 2: Progressing
├─ 10% traffic (1 minute)
├─ 检查指标...✅
├─ 20% traffic (1 minute)
├─ 检查指标...✅
├─ 30% traffic
├─ ...
└─ 50% traffic
↓
阶段 3: Promoting
└─ 升级 primary
↓
阶段 4: Finalising
└─ 清理资源
↓
阶段 5: Succeeded ✅
高级配置
自定义指标
1. 创建 MetricTemplate
# error-rate-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: error-rate
namespace: flagger-system
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
query: |
100 - sum(
rate(
http_requests_total{
namespace="{{ namespace }}",
pod=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)",
status!~"5.."
}[{{ interval }}]
)
)
/
sum(
rate(
http_requests_total{
namespace="{{ namespace }}",
pod=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
}[{{ interval }}]
)
)
* 100
2. 在 Canary 中引用
spec:
analysis:
metrics:
- name: my-error-rate
templateRef:
name: error-rate
namespace: flagger-system
thresholdRange:
max: 5 # 错误率不超过 5%
interval: 1m
A/B 测试
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 80
analysis:
# A/B 测试模式
iterations: 10
threshold: 5
match:
# 基于 Header 分流
- headers:
user-agent:
regex: ".*Chrome.*"
# 基于 Cookie 分流
- headers:
cookie:
regex: "^(.*?;)?(type=insider)(;.*)?$"
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
Blue-Green 部署
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 80
analysis:
# Blue-Green 模式
iterations: 10
threshold: 5
# 镜像流量(不影响用户)
mirror: true
mirrorWeight: 100
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
webhooks:
# 手动批准
- name: manual-gate
type: confirm-rollout
url: http://flagger-loadtester.test/gate/approve
多阶段门控
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
analysis:
webhooks:
# 阶段 1:预发布检查
- name: pre-rollout
type: pre-rollout
url: http://my-service.test/pre-rollout
timeout: 30s
metadata:
cmd: "check-dependencies.sh"
# 阶段 2:加载测试
- name: load-test
url: http://flagger-loadtester.test/
timeout: 15s
metadata:
cmd: "hey -z 2m -q 10 -c 2 http://myapp-canary/"
# 阶段 3:集成测试
- name: integration-test
url: http://my-test-runner.test/run
timeout: 5m
metadata:
test-suite: "integration"
# 阶段 4:手动批准(重要环境)
- name: manual-approval
type: confirm-rollout
url: http://my-approval-service/approve
# 阶段 5:发布后验证
- name: post-rollout
type: post-rollout
url: http://my-service.test/post-rollout
timeout: 30s
通知集成
Slack 通知
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
analysis:
alerts:
- name: slack
severity: info
providerRef:
name: slack
namespace: flagger-system
创建 Slack Provider:
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
name: slack
namespace: flagger-system
spec:
type: slack
channel: deployments
username: flagger
address: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
Microsoft Teams 通知
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
name: teams
namespace: flagger-system
spec:
type: msteams
address: https://outlook.office.com/webhook/YOUR/TEAMS/WEBHOOK
Discord 通知
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
name: discord
namespace: flagger-system
spec:
type: discord
username: flagger
channel: "1234567890"
address: https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK
实战场景
场景 1:使用 Istio 的生产环境
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
# HPA 自动缩放
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: myapp
service:
port: 80
gateways:
- istio-system/public-gateway
hosts:
- myapp.example.com
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
analysis:
interval: 1m
threshold: 10
maxWeight: 50
stepWeight: 5
metrics:
- name: request-success-rate
thresholdRange:
min: 99.5
interval: 1m
- name: request-duration
thresholdRange:
max: 1000
interval: 1m
- name: istio_requests_total
templateRef:
name: istio-requests
namespace: istio-system
thresholdRange:
min: 10
interval: 1m
webhooks:
# 负载测试
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 https://myapp.example.com/"
# Slack 告警
alerts:
- name: slack
severity: error
providerRef:
name: slack
namespace: flagger-system
场景 2:基于业务指标的金丝雀
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment-service
spec:
analysis:
metrics:
# 技术指标
- name: request-success-rate
thresholdRange:
min: 99
# 业务指标 1:支付成功率
- name: payment-success-rate
templateRef:
name: payment-success
thresholdRange:
min: 98
interval: 1m
# 业务指标 2:平均交易金额
- name: avg-transaction-amount
templateRef:
name: transaction-amount
thresholdRange:
min: 50 # 不低于 $50
interval: 2m
MetricTemplate:
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: payment-success
spec:
provider:
type: prometheus
address: http://prometheus:9090
query: |
sum(rate(payments_total{status="success"}[{{ interval }}]))
/
sum(rate(payments_total[{{ interval }}]))
* 100
故障排查
问题 1:金丝雀一直 Progressing
# 查看详细状态
kubectl -n test describe canary myapp
# 查看 Flagger 日志
kubectl -n istio-system logs deploy/flagger -f
# 常见原因:
# 1. 指标查询失败
# 2. Prometheus 地址错误
# 3. 阈值设置过严
问题 2:自动回滚
# 查看回滚原因
kubectl -n test describe canary myapp | grep -A 10 "Status"
# 查看事件
kubectl -n test get events --field-selector involvedObject.name=myapp
# 检查指标
kubectl -n test get canary myapp -o jsonpath='{.status.conditions}' | jq .
问题 3:LoadTester 超时
# 查看 LoadTester 日志
kubectl -n test logs deploy/flagger-loadtester
# 手动测试
kubectl -n test exec -it deploy/flagger-loadtester -- \
hey -z 30s -c 2 http://myapp-canary/
总结
Flagger 优势
✅ 完全自动化:无需人工干预
✅ 指标驱动:基于真实数据决策
✅ 渐进式交付:风险可控
✅ 多网格支持:Istio、Linkerd、Nginx 等
✅ 可扩展性:自定义指标和 Webhook
最佳实践
- 从小流量开始:
stepWeight: 5或更小 - 足够的观察时间:
interval: 1m至少 - 设置合理阈值:不要过于严格
- 集成告警:及时通知团队
- 负载测试:确保有足够流量验证
- 业务指标:结合技术和业务指标
适用场景
| 场景 | 推荐 Flagger |
|---|---|
| 微服务架构 | ✅ 强烈推荐 |
| 使用服务网格 | ✅ 最佳选择 |
| 需要自动化 | ✅ 完美契合 |
| 小团队 | ⚠️ 学习成本 |
| 单体应用 | ❌ 过于复杂 |
下一节将对比各种金丝雀发布方案,帮助选择合适的工具。