Flagger 自动化金丝雀发布

Flagger 自动化金丝雀发布

Flagger 简介

Flagger 是 Weaveworks 开源的渐进式交付工具,实现全自动的金丝雀发布。

核心特性

自动化流程:自动调整流量、监控指标、决策回滚
多种网格支持:Istio、Linkerd、App Mesh、Nginx、Contour
指标驱动:基于 Prometheus 指标自动决策
通知集成:Slack、Teams、Discord 告警
负载测试:内置负载测试能力

工作流程

1. 检测到新版本
   ↓
2. 初始化金丝雀(创建金丝雀 Pod)
   ↓
3. 运行预发布测试(smoke tests)
   ↓
4. 开始流量切换
   ├─ 10% → 检查指标 → 通过
   ├─ 20% → 检查指标 → 通过
   ├─ 30% → 检查指标 → 通过
   ├─ 50% → 检查指标 → 失败 → 自动回滚 ❌
   └─ 100% → 升级完成 ✅
   ↓
5. 清理金丝雀资源

安装 Flagger

方式 1:Helm 安装(推荐)

安装到 Istio

# 添加 Flagger Helm 仓库
helm repo add flagger https://flagger.app
helm repo update

# 安装 Flagger
helm upgrade -i flagger flagger/flagger \
  --namespace istio-system \
  --set meshProvider=istio \
  --set metricsServer=http://prometheus.istio-system:9090

# 验证
kubectl -n istio-system get pods -l app.kubernetes.io/name=flagger

安装到 Nginx Ingress

helm upgrade -i flagger flagger/flagger \
  --namespace ingress-nginx \
  --set meshProvider=nginx \
  --set metricsServer=http://prometheus-kube-prometheus-prometheus.monitoring:9090

安装到 Linkerd

helm upgrade -i flagger flagger/flagger \
  --namespace linkerd \
  --set meshProvider=linkerd \
  --set metricsServer=http://prometheus.linkerd-viz:9090

方式 2:Kubectl 安装

# 安装 Flagger CRDs
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/crd.yaml

# 安装 Flagger
kubectl apply -f https://raw.githubusercontent.com/fluxcd/flagger/main/artifacts/flagger/deployment.yaml

安装 Flagger LoadTester

# LoadTester 用于自动化测试
helm upgrade -i flagger-loadtester flagger/loadtester \
  --namespace test \
  --create-namespace

# 验证
kubectl -n test get pods -l app=flagger-loadtester

基础金丝雀配置

1. 准备应用 Deployment

# myapp-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: nginx:1.20
        ports:
        - name: http
          containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi

应用:

kubectl create namespace test
kubectl apply -f myapp-deployment.yaml

2. 创建 Canary 资源

# myapp-canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: test
spec:
  # 目标 Deployment
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  
  # 自动缩放配置(可选)
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: myapp
  
  # Service 配置
  service:
    port: 80
    targetPort: 80
    portDiscovery: true
  
  # 金丝雀分析
  analysis:
    # 检查间隔
    interval: 1m
    # 阈值:连续成功次数
    threshold: 5
    # 最大流量权重
    maxWeight: 50
    # 权重步进
    stepWeight: 10
    
    # 指标检查
    metrics:
    # 1. 请求成功率
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    
    # 2. 请求持续时间
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    
    # 3. 自定义 Prometheus 查询
    - name: error-rate
      templateRef:
        name: error-rate
        namespace: flagger-system
      thresholdRange:
        max: 1
      interval: 1m
    
    # Webhook 测试
    webhooks:
    # 负载测试
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 15s
      metadata:
        type: cmd
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary.test/"
    
    # 冒烟测试
    - name: smoke-test
      url: http://flagger-loadtester.test/
      timeout: 30s
      metadata:
        type: bash
        cmd: |
          curl -s http://myapp-canary.test/ | grep "Welcome" || exit 1

应用:

kubectl apply -f myapp-canary.yaml

# 监控状态
kubectl -n test get canary myapp -w

3. Flagger 自动创建的资源

Flagger 会自动创建:

# 1. Primary Deployment (稳定版本)
kubectl get deployment myapp-primary -n test

# 2. Canary Service
kubectl get service myapp-canary -n test

# 3. Primary Service
kubectl get service myapp -n test

# 4. VirtualService (如果使用 Istio)
kubectl get virtualservice myapp -n test

架构:

myapp (Deployment)
   ↓
Flagger 接管
   ↓
├─ myapp-primary (自动创建,稳定版本)
└─ myapp (金丝雀版本)
   ↓
Services:
├─ myapp (主 Service)
└─ myapp-canary (金丝雀 Service)

触发金丝雀发布

更新镜像

# 方法 1:使用 kubectl set image
kubectl -n test set image deployment/myapp \
  myapp=nginx:1.21

# 方法 2:编辑 Deployment
kubectl -n test edit deployment myapp
# 修改 image: nginx:1.21

# 方法 3:使用 kubectl patch
kubectl -n test patch deployment myapp \
  -p '{"spec":{"template":{"spec":{"containers":[{"name":"myapp","image":"nginx:1.21"}]}}}}'

监控发布过程

# 实时监控
watch kubectl -n test get canary myapp

# 查看详细信息
kubectl -n test describe canary myapp

# 查看事件
kubectl -n test get events --field-selector involvedObject.name=myapp --sort-by='.lastTimestamp'

发布阶段

阶段 1: Initialized
↓
阶段 2: Progressing
  ├─ 10% traffic (1 minute)
  ├─ 检查指标...✅
  ├─ 20% traffic (1 minute)
  ├─ 检查指标...✅
  ├─ 30% traffic
  ├─ ...
  └─ 50% traffic
↓
阶段 3: Promoting
  └─ 升级 primary
↓
阶段 4: Finalising
  └─ 清理资源
↓
阶段 5: Succeeded ✅

高级配置

自定义指标

1. 创建 MetricTemplate

# error-rate-template.yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: flagger-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.monitoring:9090
  query: |
    100 - sum(
      rate(
        http_requests_total{
          namespace="{{ namespace }}",
          pod=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)",
          status!~"5.."
        }[{{ interval }}]
      )
    )
    /
    sum(
      rate(
        http_requests_total{
          namespace="{{ namespace }}",
          pod=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
        }[{{ interval }}]
      )
    )
    * 100

2. 在 Canary 中引用

spec:
  analysis:
    metrics:
    - name: my-error-rate
      templateRef:
        name: error-rate
        namespace: flagger-system
      thresholdRange:
        max: 5  # 错误率不超过 5%
      interval: 1m

A/B 测试

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 80
  analysis:
    # A/B 测试模式
    iterations: 10
    threshold: 5
    match:
    # 基于 Header 分流
    - headers:
        user-agent:
          regex: ".*Chrome.*"
    # 基于 Cookie 分流
    - headers:
        cookie:
          regex: "^(.*?;)?(type=insider)(;.*)?$"
    
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m

Blue-Green 部署

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 80
  analysis:
    # Blue-Green 模式
    iterations: 10
    threshold: 5
    # 镜像流量(不影响用户)
    mirror: true
    mirrorWeight: 100
    
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    
    webhooks:
    # 手动批准
    - name: manual-gate
      type: confirm-rollout
      url: http://flagger-loadtester.test/gate/approve

多阶段门控

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  analysis:
    webhooks:
    # 阶段 1:预发布检查
    - name: pre-rollout
      type: pre-rollout
      url: http://my-service.test/pre-rollout
      timeout: 30s
      metadata:
        cmd: "check-dependencies.sh"
    
    # 阶段 2:加载测试
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 15s
      metadata:
        cmd: "hey -z 2m -q 10 -c 2 http://myapp-canary/"
    
    # 阶段 3:集成测试
    - name: integration-test
      url: http://my-test-runner.test/run
      timeout: 5m
      metadata:
        test-suite: "integration"
    
    # 阶段 4:手动批准(重要环境)
    - name: manual-approval
      type: confirm-rollout
      url: http://my-approval-service/approve
    
    # 阶段 5:发布后验证
    - name: post-rollout
      type: post-rollout
      url: http://my-service.test/post-rollout
      timeout: 30s

通知集成

Slack 通知

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  analysis:
    alerts:
    - name: slack
      severity: info
      providerRef:
        name: slack
        namespace: flagger-system

创建 Slack Provider:

apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
  name: slack
  namespace: flagger-system
spec:
  type: slack
  channel: deployments
  username: flagger
  address: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK

Microsoft Teams 通知

apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
  name: teams
  namespace: flagger-system
spec:
  type: msteams
  address: https://outlook.office.com/webhook/YOUR/TEAMS/WEBHOOK

Discord 通知

apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
  name: discord
  namespace: flagger-system
spec:
  type: discord
  username: flagger
  channel: "1234567890"
  address: https://discord.com/api/webhooks/YOUR/DISCORD/WEBHOOK

实战场景

场景 1:使用 Istio 的生产环境

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  
  # HPA 自动缩放
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: myapp
  
  service:
    port: 80
    gateways:
    - istio-system/public-gateway
    hosts:
    - myapp.example.com
    trafficPolicy:
      tls:
        mode: ISTIO_MUTUAL
  
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99.5
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 1000
      interval: 1m
    - name: istio_requests_total
      templateRef:
        name: istio-requests
        namespace: istio-system
      thresholdRange:
        min: 10
      interval: 1m
    
    webhooks:
    # 负载测试
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        type: cmd
        cmd: "hey -z 1m -q 10 -c 2 https://myapp.example.com/"
    
    # Slack 告警
    alerts:
    - name: slack
      severity: error
      providerRef:
        name: slack
        namespace: flagger-system

场景 2:基于业务指标的金丝雀

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  analysis:
    metrics:
    # 技术指标
    - name: request-success-rate
      thresholdRange:
        min: 99
    
    # 业务指标 1:支付成功率
    - name: payment-success-rate
      templateRef:
        name: payment-success
      thresholdRange:
        min: 98
      interval: 1m
    
    # 业务指标 2:平均交易金额
    - name: avg-transaction-amount
      templateRef:
        name: transaction-amount
      thresholdRange:
        min: 50  # 不低于 $50
      interval: 2m

MetricTemplate:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: payment-success
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    sum(rate(payments_total{status="success"}[{{ interval }}]))
    /
    sum(rate(payments_total[{{ interval }}]))
    * 100

故障排查

问题 1:金丝雀一直 Progressing

# 查看详细状态
kubectl -n test describe canary myapp

# 查看 Flagger 日志
kubectl -n istio-system logs deploy/flagger -f

# 常见原因:
# 1. 指标查询失败
# 2. Prometheus 地址错误
# 3. 阈值设置过严

问题 2:自动回滚

# 查看回滚原因
kubectl -n test describe canary myapp | grep -A 10 "Status"

# 查看事件
kubectl -n test get events --field-selector involvedObject.name=myapp

# 检查指标
kubectl -n test get canary myapp -o jsonpath='{.status.conditions}' | jq .

问题 3:LoadTester 超时

# 查看 LoadTester 日志
kubectl -n test logs deploy/flagger-loadtester

# 手动测试
kubectl -n test exec -it deploy/flagger-loadtester -- \
  hey -z 30s -c 2 http://myapp-canary/

总结

Flagger 优势

完全自动化:无需人工干预
指标驱动:基于真实数据决策
渐进式交付:风险可控
多网格支持:Istio、Linkerd、Nginx 等
可扩展性:自定义指标和 Webhook

最佳实践

  1. 从小流量开始stepWeight: 5 或更小
  2. 足够的观察时间interval: 1m 至少
  3. 设置合理阈值:不要过于严格
  4. 集成告警:及时通知团队
  5. 负载测试:确保有足够流量验证
  6. 业务指标:结合技术和业务指标

适用场景

场景 推荐 Flagger
微服务架构 ✅ 强烈推荐
使用服务网格 ✅ 最佳选择
需要自动化 ✅ 完美契合
小团队 ⚠️ 学习成本
单体应用 ❌ 过于复杂

下一节将对比各种金丝雀发布方案,帮助选择合适的工具。