监控和日志系统

本章节详细介绍如何在 EKS 集群中部署完整的监控和日志系统,包括 Prometheus、Grafana、Alertmanager、ELK Stack 和 Jaeger 分布式追踪。

监控架构设计

监控层次

┌─────────────────────────────────────────────────────┐
│              业务指标监控层                          │
│  ├─ 订单量、转化率、用户活跃度                      │
│  ├─ API 调用成功率、响应时间                        │
│  └─ 业务异常和错误率                                │
└────────────────────┬────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────┐
│              应用指标监控层                          │
│  ├─ JVM/Go/Node.js 运行时指标                       │
│  ├─ 应用日志和异常                                  │
│  ├─ 自定义业务指标                                  │
│  └─ 分布式追踪(Jaeger)                           │
└────────────────────┬────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────┐
│           容器和 Kubernetes 监控层                   │
│  ├─ Pod CPU/内存/网络/磁盘                          │
│  ├─ Container 资源使用                              │
│  ├─ Kubernetes 事件                                 │
│  └─ HPA/VPA 扩展指标                                │
└────────────────────┬────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────────┐
│              基础设施监控层                          │
│  ├─ EC2 实例指标(Node)                            │
│  ├─ RDS/Redis/DynamoDB 指标                         │
│  ├─ ALB/NLB 指标                                    │
│  └─ VPC/NAT Gateway 指标                            │
└─────────────────────────────────────────────────────┘
                     ↓
              Prometheus + Grafana
              Alertmanager + PagerDuty

组件架构

┌─────────────────────────────────────────────────────┐
│                   EKS Cluster                        │
│                                                      │
│  ┌────────────────────────────────────────────┐    │
│  │          Prometheus Operator                │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Prometheus Server (HA)              │  │    │
│  │  │  ├─ ServiceMonitors                   │  │    │
│  │  │  ├─ PodMonitors                       │  │    │
│  │  │  └─ PrometheusRules                   │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  │                                              │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Alertmanager (HA)                   │  │    │
│  │  │  ├─ Slack Notifications              │  │    │
│  │  │  ├─ PagerDuty Integration            │  │    │
│  │  │  └─ Email Alerts                     │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  │                                              │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Grafana                              │  │    │
│  │  │  ├─ Dashboards                        │  │    │
│  │  │  ├─ Data Sources                      │  │    │
│  │  │  └─ Alert Rules                       │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  └────────────────────────────────────────────┘    │
│                                                      │
│  ┌────────────────────────────────────────────┐    │
│  │          ELK Stack                          │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Elasticsearch (3 nodes)             │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Kibana                               │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  │  ┌──────────────────────────────────────┐  │    │
│  │  │  Filebeat (DaemonSet)                │  │    │
│  │  └──────────────────────────────────────┘  │    │
│  └────────────────────────────────────────────┘    │
│                                                      │
│  ┌────────────────────────────────────────────┐    │
│  │       Jaeger (Distributed Tracing)          │    │
│  │  ├─ Jaeger Collector                        │    │
│  │  ├─ Jaeger Query                            │    │
│  │  └─ Jaeger Agent (Sidecar)                  │    │
│  └────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────┘

Prometheus 部署

安装 Prometheus Operator

#!/bin/bash
# install-prometheus-operator.sh

NAMESPACE="monitoring"

echo "================================================"
echo "安装 Prometheus Operator"
echo "================================================"

# 1. 创建 namespace
echo ""
echo "1. 创建 monitoring namespace..."
kubectl create namespace $NAMESPACE

# 2. 添加 Helm 仓库
echo ""
echo "2. 添加 Prometheus Helm 仓库..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 3. 创建自定义 values 文件
echo ""
echo "3. 创建配置文件..."
cat > prometheus-values.yaml << 'EOF'
# Prometheus Operator 配置
prometheus:
  prometheusSpec:
    # 资源配置
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi
    
    # 存储配置
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    
    # 数据保留
    retention: 30d
    retentionSize: "45GB"
    
    # 高可用配置
    replicas: 2
    
    # Pod 反亲和性
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - prometheus
            topologyKey: topology.kubernetes.io/zone
    
    # Service Monitor 选择器
    serviceMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}
    
    # Pod Monitor 选择器
    podMonitorSelector: {}
    podMonitorNamespaceSelector: {}
    
    # 额外的抓取配置
    additionalScrapeConfigs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

# Alertmanager 配置
alertmanager:
  alertmanagerSpec:
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 200m
        memory: 512Mi
    
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    
    replicas: 3
    
    affinity:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/name
                operator: In
                values:
                - alertmanager
            topologyKey: topology.kubernetes.io/zone

# Grafana 配置
grafana:
  enabled: true
  
  adminPassword: "ChangeMe123!"
  
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi
  
  persistence:
    enabled: true
    storageClassName: gp3
    size: 10Gi
  
  # 数据源
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: Prometheus
        type: prometheus
        url: http://prometheus-operated:9090
        access: proxy
        isDefault: true
  
  # 仪表盘提供者
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
  
  # 预装仪表盘
  dashboards:
    default:
      kubernetes-cluster:
        gnetId: 7249
        revision: 1
        datasource: Prometheus
      kubernetes-pods:
        gnetId: 6417
        revision: 1
        datasource: Prometheus
      node-exporter:
        gnetId: 1860
        revision: 27
        datasource: Prometheus

# Prometheus Node Exporter
prometheus-node-exporter:
  enabled: true

# Kube State Metrics
kube-state-metrics:
  enabled: true
EOF

# 4. 安装 Prometheus Stack
echo ""
echo "4. 安装 Prometheus Stack..."
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
  -n $NAMESPACE \
  -f prometheus-values.yaml

echo ""
echo "5. 等待 Pods 就绪..."
kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus \
  -n $NAMESPACE \
  --timeout=300s

echo ""
echo "================================================"
echo "Prometheus Operator 安装完成!"
echo "================================================"
echo ""
echo "访问方式:"
echo "  Prometheus: kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-prometheus 9090:9090"
echo "  Grafana: kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80"
echo "  Alertmanager: kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093"
echo ""
echo "Grafana 默认凭证:"
echo "  用户名: admin"
echo "  密码: ChangeMe123!"
echo "================================================"

配置 Alertmanager

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-stack-kube-prom-alertmanager
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
      # 严重告警
      - match:
          severity: critical
        receiver: 'pagerduty-critical'
        continue: true
      - match:
          severity: critical
        receiver: 'slack-critical'
      
      # 警告告警
      - match:
          severity: warning
        receiver: 'slack-warning'
      
      # 信息告警
      - match:
          severity: info
        receiver: 'slack-info'
    
    receivers:
    # 默认接收器
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    # 严重告警 - PagerDuty
    - name: 'pagerduty-critical'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}'
    
    # 严重告警 - Slack
    - name: 'slack-critical'
      slack_configs:
      - channel: '#alerts-critical'
        color: 'danger'
        title: '🚨 Critical Alert: {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }}
          *Severity:* {{ .Labels.severity }}
          *Description:* {{ .Annotations.description }}
          *Details:* {{ .Annotations.summary }}
          {{ end }}
    
    # 警告告警 - Slack
    - name: 'slack-warning'
      slack_configs:
      - channel: '#alerts-warning'
        color: 'warning'
        title: '⚠️ Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    # 信息告警 - Slack
    - name: 'slack-info'
      slack_configs:
      - channel: '#alerts-info'
        color: 'good'
        title: 'ℹ️ Info: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

创建告警规则

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: application-alerts
  namespace: monitoring
  labels:
    prometheus: kube-prometheus
spec:
  groups:
  # 应用健康告警
  - name: application-health
    interval: 30s
    rules:
    # Pod 重启频繁
    - alert: PodRestartingFrequently
      expr: |
        rate(kube_pod_container_status_restarts_total[15m]) > 0.1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 重启频繁"
        description: "Pod 在过去 15 分钟内重启了 {{ $value }} 次"
    
    # Pod 未就绪
    - alert: PodNotReady
      expr: |
        sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 未就绪"
        description: "Pod 已经 10 分钟未进入 Running 状态"
    
    # 容器 OOMKilled
    - alert: ContainerOOMKilled
      expr: |
        sum by (namespace, pod, container) (kube_pod_container_status_terminated_reason{reason="OOMKilled"}) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "容器 {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} 因 OOM 被杀死"
        description: "需要增加内存限制或优化应用内存使用"
  
  # 资源使用告警
  - name: resource-usage
    interval: 30s
    rules:
    # CPU 使用率高
    - alert: HighCPUUsage
      expr: |
        sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m])) /
        sum by (namespace, pod) (container_spec_cpu_quota / container_spec_cpu_period) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU 使用率超过 90%"
        description: "当前 CPU 使用率: {{ $value | humanizePercentage }}"
    
    # 内存使用率高
    - alert: HighMemoryUsage
      expr: |
        sum by (namespace, pod) (container_memory_working_set_bytes) /
        sum by (namespace, pod) (container_spec_memory_limit_bytes) > 0.9
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 内存使用率超过 90%"
        description: "当前内存使用率: {{ $value | humanizePercentage }}"
    
    # 磁盘使用率高
    - alert: HighDiskUsage
      expr: |
        (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) /
        node_filesystem_size_bytes{mountpoint="/"} > 0.85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "节点 {{ $labels.instance }} 磁盘使用率超过 85%"
        description: "当前磁盘使用率: {{ $value | humanizePercentage }}"
  
  # API 性能告警
  - name: api-performance
    interval: 30s
    rules:
    # API 响应时间慢
    - alert: SlowAPIResponse
      expr: |
        histogram_quantile(0.95,
          sum by (le, namespace, service) (rate(http_request_duration_seconds_bucket[5m]))
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "服务 {{ $labels.namespace }}/{{ $labels.service }} API 响应慢"
        description: "P95 响应时间: {{ $value }}s"
    
    # API 错误率高
    - alert: HighAPIErrorRate
      expr: |
        sum by (namespace, service) (rate(http_requests_total{code=~"5.."}[5m])) /
        sum by (namespace, service) (rate(http_requests_total[5m])) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "服务 {{ $labels.namespace }}/{{ $labels.service }} 错误率高"
        description: "当前错误率: {{ $value | humanizePercentage }}"
  
  # 数据库告警
  - name: database-alerts
    interval: 30s
    rules:
    # RDS CPU 高
    - alert: RDSHighCPU
      expr: |
        aws_rds_cpuutilization_average > 80
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "RDS {{ $labels.dbinstance_identifier }} CPU 使用率高"
        description: "当前 CPU: {{ $value }}%"
    
    # RDS 连接数高
    - alert: RDSHighConnections
      expr: |
        aws_rds_database_connections_average > 800
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "RDS {{ $labels.dbinstance_identifier }} 连接数过高"
        description: "当前连接数: {{ $value }}"
    
    # Redis 内存使用率高
    - alert: RedisHighMemory
      expr: |
        redis_memory_used_bytes / redis_memory_max_bytes > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Redis {{ $labels.instance }} 内存使用率高"
        description: "当前使用率: {{ $value | humanizePercentage }}"

ServiceMonitor 示例

# user-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: user-service
  namespace: production
  labels:
    app: user-service
spec:
  selector:
    matchLabels:
      app: user-service
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http

Grafana 仪表盘

自定义业务仪表盘

{
  "dashboard": {
    "title": "Business Metrics Dashboard",
    "tags": ["business", "production"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "订单量(实时)",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(orders_total[5m]))",
            "legendFormat": "每秒订单数"
          }
        ],
        "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
      },
      {
        "id": 2,
        "title": "API 成功率",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
          }
        ],
        "gridPos": {"x": 12, "y": 0, "w": 6, "h": 4}
      },
      {
        "id": 3,
        "title": "P95 响应时间",
        "type": "stat",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          }
        ],
        "gridPos": {"x": 18, "y": 0, "w": 6, "h": 4}
      }
    ]
  }
}

ELK Stack 部署

Elasticsearch 部署

# elasticsearch-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: monitoring
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
      - name: increase-vm-max-map
        image: busybox
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        securityContext:
          privileged: true
      
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
        env:
        - name: cluster.name
          value: "production-logs"
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms2g -Xmx2g"
        - name: xpack.security.enabled
          value: "false"
        
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        
        resources:
          requests:
            cpu: 1000m
            memory: 3Gi
          limits:
            cpu: 2000m
            memory: 4Gi
        
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: gp3
      resources:
        requests:
          storage: 100Gi

---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: monitoring
spec:
  selector:
    app: elasticsearch
  ports:
  - port: 9200
    name: http
  - port: 9300
    name: transport
  clusterIP: None

Kibana 部署

# kibana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:8.11.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "http://elasticsearch:9200"
        - name: SERVER_NAME
          value: "kibana"
        - name: SERVER_HOST
          value: "0.0.0.0"
        
        ports:
        - containerPort: 5601
          name: http
        
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 2Gi
        
        readinessProbe:
          httpGet:
            path: /api/status
            port: 5601
          initialDelaySeconds: 30
          periodSeconds: 10

---
apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: monitoring
spec:
  selector:
    app: kibana
  ports:
  - port: 5601
    targetPort: 5601
  type: ClusterIP

Filebeat 部署

# filebeat-daemonset.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: monitoring
data:
  filebeat.yml: |
    filebeat.inputs:
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
          - logs_path:
              logs_path: "/var/log/containers/"
    
    output.elasticsearch:
      hosts: ['elasticsearch:9200']
      index: "filebeat-%{+yyyy.MM.dd}"
    
    setup.template.name: "filebeat"
    setup.template.pattern: "filebeat-*"
    setup.ilm.enabled: false

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: filebeat
  template:
    metadata:
      labels:
        app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:8.11.0
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
          limits:
            cpu: 200m
            memory: 400Mi
        
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
      
      volumes:
      - name: config
        configMap:
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      - name: data
        hostPath:
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate

Jaeger 分布式追踪

安装 Jaeger Operator

#!/bin/bash
# install-jaeger.sh

NAMESPACE="monitoring"

echo "安装 Jaeger Operator..."

# 安装 cert-manager(前置依赖)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 安装 Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.50.0/jaeger-operator.yaml -n observability

echo "✓ Jaeger Operator 已安装"

Jaeger 实例配置

# jaeger-instance.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-production
  namespace: monitoring
spec:
  strategy: production
  
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
        index-prefix: jaeger
    esIndexCleaner:
      enabled: true
      numberOfDays: 7
      schedule: "55 23 * * *"
  
  collector:
    replicas: 3
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        cpu: 1000m
        memory: 2Gi
  
  query:
    replicas: 2
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 500m
        memory: 1Gi
  
  agent:
    strategy: DaemonSet
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 200m
        memory: 256Mi

最佳实践总结

1. Prometheus 配置

✓ 使用 Operator 管理
✓ 配置高可用(2+ 副本)
✓ 合理设置数据保留期
✓ 使用持久化存储
✓ 配置 ServiceMonitor 自动发现
✓ 分层设置告警规则

2. Grafana 配置

✓ 使用持久化存储
✓ 配置多数据源
✓ 创建分层仪表盘
✓ 启用告警通知
✓ 定期备份配置
✓ 使用变量提高复用性

3. Alertmanager 配置

✓ 配置多接收器
✓ 按严重性分组
✓ 避免告警风暴
✓ 配置抑制规则
✓ 集成 PagerDuty
✓ 定期测试告警

4. 日志管理

✓ 集中式日志收集
✓ 结构化日志格式
✓ 配置索引生命周期
✓ 设置日志保留策略
✓ 使用日志级别过滤
✓ 定期清理旧日志

5. 分布式追踪

✓ 在所有服务启用追踪
✓ 使用采样降低开销
✓ 配置上下文传播
✓ 设置合理的保留期
✓ 监控追踪系统性能
✓ 与日志和指标关联

下一步: 继续学习 自动扩展和容灾 章节。