运维和最佳实践

运维和最佳实践

本节介绍微服务应用的日常运维操作和生产环境最佳实践。

滚动更新

更新镜像版本

# 更新单个服务
kubectl set image deployment/product-service \
  product-service=ecommerce/product-service:v1.1.0 \
  -n ecommerce

# 查看滚动更新状态
kubectl rollout status deployment/product-service -n ecommerce

# 查看更新历史
kubectl rollout history deployment/product-service -n ecommerce

回滚到上一版本

# 回滚到上一个版本
kubectl rollout undo deployment/product-service -n ecommerce

# 回滚到指定版本
kubectl rollout undo deployment/product-service --to-revision=2 -n ecommerce

# 暂停/恢复滚动更新
kubectl rollout pause deployment/product-service -n ecommerce
kubectl rollout resume deployment/product-service -n ecommerce

扩缩容操作

手动扩缩容

# 扩容到 5 个副本
kubectl scale deployment product-service --replicas=5 -n ecommerce

# 查看副本数
kubectl get deployment product-service -n ecommerce

HPA 自动扩缩容

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: product-service-hpa
  namespace: ecommerce
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: product-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

日志管理

查看日志

# 查看实时日志
kubectl logs -f product-service-abc123 -n ecommerce

# 查看最近 100 行
kubectl logs --tail=100 product-service-abc123 -n ecommerce

# 查看特定容器日志
kubectl logs product-service-abc123 -c product-service -n ecommerce

# 查看上一个容器日志(Pod 重启后)
kubectl logs product-service-abc123 --previous -n ecommerce

# 查看所有副本日志
kubectl logs -l app=product-service -n ecommerce --tail=50

日志聚合(使用 stern)

# 安装 stern
brew install stern

# 查看所有 product-service Pod 日志
stern product-service -n ecommerce

# 使用标签选择器
stern -l app=product-service -n ecommerce

# 查看多个命名空间
stern product-service --all-namespaces

监控和告警

ServiceMonitor 配置

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ecommerce-services
  namespace: ecommerce
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      tier: backend
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

PrometheusRule 告警

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ecommerce-alerts
  namespace: ecommerce
spec:
  groups:
  - name: ecommerce.rules
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status_code=~"5.."}[5m]) 
        / 
        rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "服务错误率过高"
        description: "{{ $labels.app }} 错误率 {{ $value | humanizePercentage }}"
    
    - alert: PodCrashLooping
      expr: rate(kube_pod_container_status_restarts_total{namespace="ecommerce"}[15m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod 频繁重启"
        description: "{{ $labels.pod }} 在过去 15 分钟内重启了 {{ $value }} 次"

健康检查优化

Liveness Probe

livenessProbe:
  httpGet:
    path: /health
    port: 8080
    httpHeaders:
    - name: X-Health-Check
      value: "liveness"
  initialDelaySeconds: 60  # 应用启动需要的时间
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3      # 失败 3 次后重启
  successThreshold: 1

Readiness Probe

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3      # 失败 3 次后从负载均衡移除
  successThreshold: 1

Startup Probe(慢启动应用)

startupProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 30     # 最多等待 300 秒启动

配置管理最佳实践

环境变量注入顺序

containers:
- name: app
  # 1. 直接定义的环境变量
  env:
  - name: APP_ENV
    value: "production"
  
  # 2. 从 ConfigMap 注入所有键值
  envFrom:
  - configMapRef:
      name: app-config
  
  # 3. 从 Secret 注入特定值
  env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef:
        name: db-secret
        key: password

配置热更新

# 更新 ConfigMap
kubectl create configmap app-config \
  --from-file=config.json \
  --dry-run=client -o yaml | kubectl apply -f -

# 重启 Pod 使配置生效
kubectl rollout restart deployment/product-service -n ecommerce

资源优化

资源请求和限制建议

resources:
  requests:
    # 设置为应用正常运行所需的最小资源
    cpu: 200m      # 0.2 核
    memory: 256Mi  # 256MB
  limits:
    # 设置为峰值时的资源上限(建议 2-4 倍 requests)
    cpu: 500m      # 0.5 核
    memory: 512Mi  # 512MB

QoS 类别

# Guaranteed(最高优先级)
# requests == limits
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 512Mi

# Burstable(中等优先级)
# 设置 requests,limits > requests
resources:
  requests:
    cpu: 200m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

# BestEffort(最低优先级)
# 不设置 requests 和 limits

数据备份

MongoDB 备份脚本

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mongodb-backup
  namespace: ecommerce
spec:
  schedule: "0 2 * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: mongo:6.0
            command:
            - /bin/sh
            - -c
            - |
              DATE=$(date +%Y%m%d-%H%M%S)
              mongodump \
                --host=mongodb-0.mongodb \
                --username=admin \
                --password=$MONGO_PASSWORD \
                --authenticationDatabase=admin \
                --gzip \
                --archive=/backup/dump-$DATE.gz
              
              # 删除 7 天前的备份
              find /backup -name "dump-*.gz" -mtime +7 -delete
            env:
            - name: MONGO_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secrets
                  key: mongo-password
            volumeMounts:
            - name: backup
              mountPath: /backup
          volumes:
          - name: backup
            persistentVolumeClaim:
              claimName: backup-pvc
          restartPolicy: OnFailure

安全加固

Pod Security Context

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000
  fsGroup: 2000
  seccompProfile:
    type: RuntimeDefault
  
containers:
- name: app
  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop:
      - ALL
      add:
      - NET_BIND_SERVICE

Network Policy

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: product-service-policy
  namespace: ecommerce
spec:
  podSelector:
    matchLabels:
      app: product-service
  policyTypes:
  - Ingress
  - Egress
  ingress:
  # 只允许来自 API Gateway 的请求
  - from:
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080
  egress:
  # 允许访问 MongoDB
  - to:
    - podSelector:
        matchLabels:
          app: mongodb
    ports:
    - protocol: TCP
      port: 27017
  # 允许 DNS 查询
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

故障排查清单

# 1. 检查 Pod 状态
kubectl get pods -n ecommerce
kubectl describe pod <pod-name> -n ecommerce

# 2. 查看事件
kubectl get events -n ecommerce --sort-by='.lastTimestamp'

# 3. 查看日志
kubectl logs <pod-name> -n ecommerce --tail=100

# 4. 检查资源使用
kubectl top pods -n ecommerce
kubectl top nodes

# 5. 检查网络连接
kubectl run test -it --rm --image=busybox -n ecommerce -- sh
wget -O- http://product-service:8080/health

# 6. 检查配置
kubectl get configmap -n ecommerce
kubectl get secret -n ecommerce

# 7. 检查 RBAC 权限
kubectl auth can-i --list -n ecommerce

小结

本节介绍了微服务运维:

更新管理:滚动更新、回滚、金丝雀发布
扩缩容:手动扩容、HPA 自动扩缩容
日志管理:日志查看、聚合、分析
监控告警:ServiceMonitor、PrometheusRule
资源优化:QoS、资源限制、性能调优
安全加固:Security Context、Network Policy
备份恢复:定时备份、数据恢复

至此,微服务部署完整教程全部完成!🎉