蓝绿发布自动化工具

蓝绿发布自动化工具

探索各种工具和平台来自动化蓝绿发布流程。

Argo Rollouts 深度实践

Argo Rollouts 是专为 Kubernetes 设计的渐进式交付控制器。

完整的 Rollout 配置

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
  namespace: production
spec:
  replicas: 5
  revisionHistoryLimit: 3
  
  selector:
    matchLabels:
      app: myapp
  
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: myapp
        image: myregistry/myapp:v1.0
        ports:
        - containerPort: 8080
          protocol: TCP
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
  
  strategy:
    blueGreen:
      # 激活 Service(生产流量)
      activeService: myapp-active
      
      # 预览 Service(测试流量)
      previewService: myapp-preview
      
      # 自动升级配置
      autoPromotionEnabled: false
      autoPromotionSeconds: 30
      
      # 缩容配置
      scaleDownDelaySeconds: 300  # 5 分钟后缩容旧版本
      scaleDownDelayRevisionLimit: 2  # 保留 2 个旧版本
      
      # 反亲和性(确保蓝绿在不同节点)
      antiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution: {}
      
      # 预升级分析
      prePromotionAnalysis:
        templates:
        - templateName: smoke-test
        args:
        - name: service-name
          value: myapp-preview
      
      # 升级后分析
      postPromotionAnalysis:
        templates:
        - templateName: load-test
        - templateName: error-rate-check
        args:
        - name: service-name
          value: myapp-active

Analysis Template

1. 冒烟测试:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: smoke-test
spec:
  args:
  - name: service-name
  metrics:
  - name: smoke-test
    initialDelay: 10s
    interval: 30s
    count: 3
    successCondition: result == "success"
    provider:
      job:
        spec:
          template:
            spec:
              containers:
              - name: smoke-test
                image: curlimages/curl:latest
                command:
                - /bin/sh
                - -c
                - |
                  # 健康检查
                  curl -f http://{{args.service-name}}/health || exit 1
                  
                  # API 测试
                  curl -f http://{{args.service-name}}/api/test || exit 1
                  
                  # 登录测试
                  TOKEN=$(curl -X POST http://{{args.service-name}}/api/login \
                    -d '{"username":"test","password":"test"}' | jq -r '.token')
                  
                  if [ -z "$TOKEN" ]; then
                    echo "Login failed"
                    exit 1
                  fi
                  
                  echo "success"
              restartPolicy: Never
          backoffLimit: 1

2. 错误率检查:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
spec:
  args:
  - name: service-name
  metrics:
  - name: error-rate
    initialDelay: 30s
    interval: 60s
    count: 5
    successCondition: result < 0.05  # 错误率低于 5%
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus.monitoring:9090
        query: |
          sum(rate(http_requests_total{
            service="{{args.service-name}}",
            status=~"5.."
          }[5m]))
          /
          sum(rate(http_requests_total{
            service="{{args.service-name}}"
          }[5m]))

3. 性能测试:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: load-test
spec:
  args:
  - name: service-name
  metrics:
  - name: load-test
    initialDelay: 60s
    interval: 120s
    count: 3
    successCondition: result.p95 < 500  # P95 延迟 < 500ms
    provider:
      job:
        spec:
          template:
            spec:
              containers:
              - name: k6-load-test
                image: grafana/k6:latest
                command:
                - k6
                - run
                - -
                stdin: |
                  import http from 'k6/http';
                  import { check } from 'k6';
                  
                  export let options = {
                    stages: [
                      { duration: '30s', target: 50 },
                      { duration: '1m', target: 100 },
                      { duration: '30s', target: 0 },
                    ],
                    thresholds: {
                      http_req_duration: ['p(95)<500'],
                    },
                  };
                  
                  export default function () {
                    let res = http.get('http://{{args.service-name}}/api/test');
                    check(res, {
                      'status is 200': (r) => r.status === 200,
                    });
                  }
              restartPolicy: Never

使用 Argo Rollouts 插件

# 查看 Rollout 状态
kubectl argo rollouts get rollout myapp -w

# 促进升级(切换流量)
kubectl argo rollouts promote myapp

# 中止发布
kubectl argo rollouts abort myapp

# 重试
kubectl argo rollouts retry rollout myapp

# 重启
kubectl argo rollouts restart rollout myapp

# 查看历史
kubectl argo rollouts history rollout myapp

# 回滚
kubectl argo rollouts undo rollout myapp --to-revision=2

Argo Rollouts Dashboard

# 安装 Dashboard
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/dashboard-install.yaml

# 端口转发
kubectl port-forward -n argo-rollouts svc/argo-rollouts-dashboard 3100:3100

# 访问: http://localhost:3100

Flagger 蓝绿发布

Flagger 是另一个流行的渐进式交付工具。

安装 Flagger

# 添加 Helm 仓库
helm repo add flagger https://flagger.app

# 安装 Flagger(Nginx Ingress)
helm upgrade -i flagger flagger/flagger \
  --namespace ingress-nginx \
  --set meshProvider=nginx \
  --set metricsServer=http://prometheus.monitoring:9090

# 安装 Grafana Dashboard
helm upgrade -i flagger-grafana flagger/grafana \
  --namespace ingress-nginx \
  --set url=http://prometheus.monitoring:9090

Flagger Canary 资源(蓝绿模式)

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
  namespace: production
spec:
  # 部署引用
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  
  # 自动扩缩容
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: myapp
  
  # Service 配置
  service:
    port: 80
    targetPort: 8080
    portDiscovery: true
  
  # 蓝绿发布策略
  analysis:
    interval: 1m
    threshold: 5
    iterations: 10
    
    # 蓝绿模式
    sessionAffinity:
      cookieName: flagger-cookie
      maxAge: 21600
    
    # 指标
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    
    # Webhooks
    webhooks:
    - name: load-test
      type: pre-rollout
      url: http://flagger-loadtester.production/
      timeout: 5s
      metadata:
        type: bash
        cmd: "curl -sd 'anon' http://myapp-canary/token | jq ."
    
    - name: acceptance-test
      type: pre-rollout
      url: http://flagger-loadtester.production/
      timeout: 10s
      metadata:
        type: cmd
        cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
    
    - name: promotion-gate
      type: confirm-promotion
      url: http://flagger-loadtester.production/gate/approve

Flagger Loadtester

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flagger-loadtester
  namespace: production
spec:
  selector:
    matchLabels:
      app: flagger-loadtester
  template:
    metadata:
      labels:
        app: flagger-loadtester
    spec:
      containers:
      - name: loadtester
        image: ghcr.io/fluxcd/flagger-loadtester:latest
        ports:
        - containerPort: 8080
        command:
        - ./loadtester
        - -port=8080
        - -log-level=info
        - -timeout=1h

---
apiVersion: v1
kind: Service
metadata:
  name: flagger-loadtester
  namespace: production
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: flagger-loadtester

Spinnaker 蓝绿发布

Spinnaker 是 Netflix 开源的持续交付平台。

Spinnaker Pipeline 配置

{
  "application": "myapp",
  "name": "Blue-Green Deployment",
  "stages": [
    {
      "type": "deployManifest",
      "name": "Deploy Green",
      "account": "my-k8s-account",
      "cloudProvider": "kubernetes",
      "manifests": [
        {
          "apiVersion": "apps/v1",
          "kind": "Deployment",
          "metadata": {
            "name": "myapp-green"
          },
          "spec": {
            "replicas": 3,
            "selector": {
              "matchLabels": {
                "app": "myapp",
                "version": "green"
              }
            },
            "template": {
              "metadata": {
                "labels": {
                  "app": "myapp",
                  "version": "green"
                }
              },
              "spec": {
                "containers": [
                  {
                    "name": "myapp",
                    "image": "${ parameters.image }",
                    "ports": [
                      {
                        "containerPort": 8080
                      }
                    ]
                  }
                ]
              }
            }
          }
        }
      ],
      "source": "text"
    },
    {
      "type": "manualJudgment",
      "name": "Manual Approval",
      "instructions": "Review green environment before switching traffic",
      "judgmentInputs": [
        {
          "value": "approve"
        },
        {
          "value": "reject"
        }
      ]
    },
    {
      "type": "patchManifest",
      "name": "Switch Traffic to Green",
      "account": "my-k8s-account",
      "cloudProvider": "kubernetes",
      "manifestName": "service myapp-service",
      "patchBody": [
        {
          "op": "replace",
          "path": "/spec/selector/version",
          "value": "green"
        }
      ]
    },
    {
      "type": "wait",
      "name": "Monitor Green",
      "waitTime": 300
    },
    {
      "type": "deleteManifest",
      "name": "Delete Blue",
      "account": "my-k8s-account",
      "cloudProvider": "kubernetes",
      "manifestName": "deployment myapp-blue"
    }
  ],
  "triggers": [
    {
      "type": "webhook",
      "source": "github",
      "enabled": true
    }
  ],
  "parameters": [
    {
      "name": "image",
      "default": "myregistry/myapp:latest",
      "description": "Docker image to deploy"
    }
  ]
}

GitLab CI/CD 蓝绿发布

.gitlab-ci.yml

stages:
  - build
  - deploy-green
  - test-green
  - switch-traffic
  - cleanup

variables:
  KUBE_NAMESPACE: production
  APP_NAME: myapp

build:
  stage: build
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  only:
    - main

deploy-green:
  stage: deploy-green
  image: bitnami/kubectl:latest
  script:
    # 确定当前颜色
    - CURRENT_COLOR=$(kubectl get service $APP_NAME-service -n $KUBE_NAMESPACE -o jsonpath='{.spec.selector.version}' || echo "blue")
    - |
      if [ "$CURRENT_COLOR" == "blue" ]; then
        NEW_COLOR="green"
      else
        NEW_COLOR="blue"
      fi
    - echo "Current=$CURRENT_COLOR, New=$NEW_COLOR"
    - echo $NEW_COLOR > .color
    
    # 部署新颜色
    - kubectl set image deployment/$APP_NAME-$NEW_COLOR $APP_NAME=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA -n $KUBE_NAMESPACE
    - kubectl rollout status deployment/$APP_NAME-$NEW_COLOR -n $KUBE_NAMESPACE --timeout=5m
  artifacts:
    paths:
      - .color
  only:
    - main

test-green:
  stage: test-green
  image: curlimages/curl:latest
  script:
    - NEW_COLOR=$(cat .color)
    - echo "Testing $NEW_COLOR environment"
    
    # 冒烟测试
    - curl -f http://$APP_NAME-$NEW_COLOR-test.$KUBE_NAMESPACE.svc.cluster.local/health || exit 1
    - curl -f http://$APP_NAME-$NEW_COLOR-test.$KUBE_NAMESPACE.svc.cluster.local/api/test || exit 1
    
    # 性能测试
    - |
      for i in {1..100}; do
        curl -s -o /dev/null -w "%{http_code}\n" http://$APP_NAME-$NEW_COLOR-test.$KUBE_NAMESPACE.svc.cluster.local/
      done | grep -v 200 | wc -l > errors.txt
    
    - |
      if [ $(cat errors.txt) -gt 5 ]; then
        echo "Too many errors"
        exit 1
      fi
  dependencies:
    - deploy-green
  only:
    - main

switch-traffic:
  stage: switch-traffic
  image: bitnami/kubectl:latest
  script:
    - NEW_COLOR=$(cat .color)
    - echo "Switching traffic to $NEW_COLOR"
    - kubectl patch service $APP_NAME-service -n $KUBE_NAMESPACE -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_COLOR\"}}}"
    
    # 验证
    - sleep 10
    - kubectl get service $APP_NAME-service -n $KUBE_NAMESPACE -o yaml
  dependencies:
    - test-green
  when: manual
  only:
    - main

cleanup:
  stage: cleanup
  image: bitnami/kubectl:latest
  script:
    - NEW_COLOR=$(cat .color)
    - |
      if [ "$NEW_COLOR" == "blue" ]; then
        OLD_COLOR="green"
      else
        OLD_COLOR="blue"
      fi
    - echo "Cleaning up $OLD_COLOR environment"
    - kubectl delete service $APP_NAME-$NEW_COLOR-test -n $KUBE_NAMESPACE || true
    
    # 保留旧版本 7 天
    - sleep 604800  # 7 days
    - kubectl delete deployment $APP_NAME-$OLD_COLOR -n $KUBE_NAMESPACE
  dependencies:
    - switch-traffic
  when: manual
  only:
    - main

Jenkins Pipeline 蓝绿发布

Jenkinsfile

pipeline {
    agent any
    
    parameters {
        string(name: 'IMAGE_TAG', defaultValue: 'latest', description: 'Docker image tag')
        choice(name: 'ENVIRONMENT', choices: ['dev', 'staging', 'production'], description: 'Target environment')
    }
    
    environment {
        APP_NAME = 'myapp'
        KUBE_NAMESPACE = "${params.ENVIRONMENT}"
        REGISTRY = 'myregistry'
    }
    
    stages {
        stage('Build') {
            steps {
                script {
                    docker.build("${REGISTRY}/${APP_NAME}:${params.IMAGE_TAG}")
                    docker.withRegistry('https://myregistry', 'docker-credentials') {
                        docker.image("${REGISTRY}/${APP_NAME}:${params.IMAGE_TAG}").push()
                    }
                }
            }
        }
        
        stage('Determine Color') {
            steps {
                script {
                    def currentColor = sh(
                        script: "kubectl get service ${APP_NAME}-service -n ${KUBE_NAMESPACE} -o jsonpath='{.spec.selector.version}' || echo 'blue'",
                        returnStdout: true
                    ).trim()
                    
                    env.CURRENT_COLOR = currentColor
                    env.NEW_COLOR = currentColor == 'blue' ? 'green' : 'blue'
                    
                    echo "Current: ${env.CURRENT_COLOR}, New: ${env.NEW_COLOR}"
                }
            }
        }
        
        stage('Deploy Green') {
            steps {
                script {
                    sh """
                        kubectl set image deployment/${APP_NAME}-${env.NEW_COLOR} \
                            ${APP_NAME}=${REGISTRY}/${APP_NAME}:${params.IMAGE_TAG} \
                            -n ${KUBE_NAMESPACE}
                        
                        kubectl rollout status deployment/${APP_NAME}-${env.NEW_COLOR} \
                            -n ${KUBE_NAMESPACE} --timeout=5m
                    """
                }
            }
        }
        
        stage('Test Green') {
            steps {
                script {
                    sh """
                        kubectl run test-pod --rm -i --restart=Never \
                            --image=curlimages/curl:latest \
                            -n ${KUBE_NAMESPACE} -- \
                            curl -f http://${APP_NAME}-${env.NEW_COLOR}:80/health
                    """
                }
            }
        }
        
        stage('Approval') {
            when {
                expression { params.ENVIRONMENT == 'production' }
            }
            steps {
                input message: "Switch traffic to ${env.NEW_COLOR}?",
                      ok: 'Deploy',
                      submitter: 'ops-team'
            }
        }
        
        stage('Switch Traffic') {
            steps {
                script {
                    sh """
                        kubectl patch service ${APP_NAME}-service \
                            -n ${KUBE_NAMESPACE} \
                            -p '{"spec":{"selector":{"version":"${env.NEW_COLOR}"}}}'
                    """
                    
                    // 监控
                    sleep 60
                }
            }
        }
        
        stage('Cleanup') {
            steps {
                script {
                    timeout(time: 7, unit: 'DAYS') {
                        input message: "Delete ${env.CURRENT_COLOR} environment?",
                              ok: 'Delete'
                    }
                    
                    sh "kubectl delete deployment ${APP_NAME}-${env.CURRENT_COLOR} -n ${KUBE_NAMESPACE}"
                }
            }
        }
    }
    
    post {
        failure {
            script {
                // 回滚
                sh """
                    kubectl patch service ${APP_NAME}-service \
                        -n ${KUBE_NAMESPACE} \
                        -p '{"spec":{"selector":{"version":"${env.CURRENT_COLOR}"}}}'
                """
            }
            
            // 发送通知
            emailext(
                subject: "Pipeline Failed: ${env.JOB_NAME}",
                body: "Build ${env.BUILD_NUMBER} failed. Please check Jenkins.",
                to: 'ops-team@example.com'
            )
        }
        
        success {
            emailext(
                subject: "Pipeline Success: ${env.JOB_NAME}",
                body: "Deployed ${params.IMAGE_TAG} to ${params.ENVIRONMENT}",
                to: 'ops-team@example.com'
            )
        }
    }
}

监控和可观测性集成

Prometheus 告警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: blue-green-alerts
spec:
  groups:
  - name: blue-green-deployment
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        sum(rate(http_requests_total{status=~"5..", app="myapp"}[5m]))
        /
        sum(rate(http_requests_total{app="myapp"}[5m]))
        > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{app="myapp"}[5m])) by (le)
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "P95 latency is {{ $value }}s"

小结

自动化工具对比:

工具 优势 劣势 适用场景
Argo Rollouts K8s 原生、功能强大 学习曲线 所有场景
Flagger 自动化高、支持多网格 依赖 Prometheus 服务网格环境
Spinnaker 功能全面、多云支持 复杂、资源消耗大 企业级、多云
GitLab CI/CD 集成度高、易用 灵活性有限 GitLab 用户
Jenkins 高度定制、生态丰富 需要自己实现 传统企业

选择建议:

  • Kubernetes 原生: Argo Rollouts
  • 服务网格: Flagger + Istio
  • 企业级多云: Spinnaker
  • 快速上手: GitLab CI/CD
  • 高度定制: Jenkins

掌握这些工具,可以实现完全自动化的蓝绿发布流程!