监控与日志

监控与日志

可观测性三大支柱

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Metrics    │  │     Logs     │  │    Traces    │
│  (指标监控)   │  │   (日志收集)  │  │  (链路追踪)   │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └─────────────────┴─────────────────┘
                         │
              ┌──────────▼──────────┐
              │   可观测性平台      │
              │  (Observability)   │
              └────────────────────┘

核心目标

  • 指标: 了解系统状态
  • 日志: 了解发生了什么
  • 追踪: 了解请求流程

Prometheus - 指标监控

架构概览

┌─────────────┐     Pull Metrics     ┌──────────────┐
│ Application │◄────────────────────│  Prometheus  │
│  /metrics   │                     │    Server    │
└─────────────┘                     └──────┬───────┘
                                          │
                   ┌──────────────────────┼────────────┐
                   │                      │            │
            ┌──────▼──────┐        ┌─────▼────┐  ┌───▼──────┐
            │  Grafana    │        │AlertMana │  │  Time    │
            │ (可视化)     │        │ger(告警) │  │ Series DB│
            └─────────────┘        └──────────┘  └──────────┘

安装 Prometheus Stack

# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 安装 kube-prometheus-stack (包含 Prometheus、Grafana、Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123

# 查看部署状态
kubectl get pods -n monitoring

访问 Prometheus

# 端口转发 Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# 访问: http://localhost:9090

# 端口转发 Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 访问: http://localhost:3000 (admin/admin123)

应用暴露指标

Node.js 示例

// app.js
const express = require('express');
const promClient = require('prom-client');

const app = express();

// 创建默认指标
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// 自定义指标
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});
register.registerMetric(httpRequestDuration);

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestTotal);

// 中间件记录指标
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
    httpRequestTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
  });
  next();
});

// 暴露指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

ServiceMonitor 配置

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: production
  labels:
    release: prometheus  # 必须匹配 Prometheus 的 serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: http
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s
    relabelings:
    - sourceLabels: [__meta_kubernetes_pod_name]
      targetLabel: pod
    - sourceLabels: [__meta_kubernetes_namespace]
      targetLabel: namespace

PodMonitor 配置

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: myapp-pod-monitor
  namespace: production
spec:
  selector:
    matchLabels:
      app: myapp
  podMetricsEndpoints:
  - port: metrics
    interval: 30s
    path: /metrics

PrometheusRule - 告警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: monitoring
spec:
  groups:
  - name: myapp.rules
    interval: 30s
    rules:
    # Pod 不可用告警
    - alert: PodDown
      expr: up{job="myapp"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Pod {{ $labels.pod }} is down"
        description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been down for more than 5 minutes."
    
    # 高错误率告警
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status_code=~"5.."}[5m]) 
        / 
        rate(http_requests_total[5m]) > 0.05
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
    
    # 高延迟告警
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95, 
          rate(http_request_duration_seconds_bucket[5m])
        ) > 2
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High latency detected"
        description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
    
    # 内存使用率告警
    - alert: HighMemoryUsage
      expr: |
        (container_memory_usage_bytes{pod=~"myapp-.*"} 
        / 
        container_spec_memory_limit_bytes{pod=~"myapp-.*"}) > 0.9
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage"
        description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
    
    # CPU 使用率告警
    - alert: HighCPUUsage
      expr: |
        rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m]) > 0.8
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage"
        description: "Pod {{ $labels.pod }} CPU usage is {{ $value | humanizePercentage }}"

Alertmanager 配置

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'default'
      routes:
      - match:
          severity: critical
        receiver: 'critical'
        continue: true
      - match:
          severity: warning
        receiver: 'warning'
    
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: 'Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: '🔥 Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
    
    - name: 'warning'
      slack_configs:
      - channel: '#alerts-warning'
        title: '⚠️ Warning Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Grafana - 数据可视化

导入预设 Dashboard

# 登录 Grafana
# 访问 Dashboards → Import
# 输入以下 Dashboard ID:

# Kubernetes 集群监控
15760  # Kubernetes / Views / Global
15759  # Kubernetes / Views / Namespaces
15758  # Kubernetes / Views / Pods

# Node Exporter
1860   # Node Exporter Full

# Application Metrics
6417   # Kubernetes Deployment Statefulset Daemonset metrics

自定义 Dashboard JSON

{
  "dashboard": {
    "title": "MyApp Metrics",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"myapp\"}[5m])",
            "legendFormat": "{{pod}}"
          }
        ]
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{job=\"myapp\",status_code=~\"5..\"}[5m])",
            "legendFormat": "{{pod}} - {{status_code}}"
          }
        ]
      },
      {
        "id": 3,
        "title": "Latency (P95)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
            "legendFormat": "{{pod}}"
          }
        ]
      }
    ]
  }
}

ConfigMap 方式部署 Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  myapp-dashboard.json: |
    {
      "dashboard": {
        "title": "MyApp Dashboard",
        "panels": [...]
      }
    }

日志收集 - EFK Stack

架构

┌─────────────┐
│ Application │
│   (stdout)  │
└──────┬──────┘
       │
┌──────▼──────┐
│  Fluentd    │ ◄── 收集日志
│ DaemonSet   │
└──────┬──────┘
       │
┌──────▼──────────┐
│ Elasticsearch   │ ◄── 存储日志
│    Cluster      │
└──────┬──────────┘
       │
┌──────▼──────┐
│   Kibana    │ ◄── 查询展示
└─────────────┘

安装 Elasticsearch

apiVersion: v1
kind: Service
metadata:
  name: elasticsearch
  namespace: logging
spec:
  clusterIP: None
  ports:
  - port: 9200
    name: http
  - port: 9300
    name: transport
  selector:
    app: elasticsearch
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
  namespace: logging
spec:
  serviceName: elasticsearch
  replicas: 3
  selector:
    matchLabels:
      app: elasticsearch
  template:
    metadata:
      labels:
        app: elasticsearch
    spec:
      initContainers:
      - name: increase-vm-max-map
        image: busybox
        command: ["sysctl", "-w", "vm.max_map_count=262144"]
        securityContext:
          privileged: true
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
        env:
        - name: cluster.name
          value: "k8s-logs"
        - name: node.name
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: discovery.seed_hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
        - name: cluster.initial_master_nodes
          value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
        - name: ES_JAVA_OPTS
          value: "-Xms2g -Xmx2g"
        - name: xpack.security.enabled
          value: "false"
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
        resources:
          requests:
            memory: 2Gi
            cpu: 1
          limits:
            memory: 4Gi
            cpu: 2
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi

安装 Fluentd

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix k8s
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 5s
        retry_forever false
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: logging
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccountName: fluentd
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
        resources:
          requests:
            cpu: 100m
            memory: 200Mi
          limits:
            cpu: 500m
            memory: 500Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: fluentd-config

安装 Kibana

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kibana
  namespace: logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - name: kibana
        image: docker.elastic.co/kibana/kibana:8.11.0
        env:
        - name: ELASTICSEARCH_HOSTS
          value: "http://elasticsearch:9200"
        ports:
        - containerPort: 5601
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1
            memory: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: kibana
  namespace: logging
spec:
  type: NodePort
  ports:
  - port: 5601
    targetPort: 5601
  selector:
    app: kibana

Loki - 轻量级日志方案

安装 Loki Stack

# 使用 Helm 安装 Loki + Promtail + Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace \
  --set grafana.enabled=true \
  --set prometheus.enabled=true \
  --set promtail.enabled=true

Promtail 配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: logging
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://loki:3100/loki/api/v1/push

    scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        target_label: container

链路追踪 - Jaeger

安装 Jaeger

kubectl create namespace tracing
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml

Jaeger 实例

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch.logging:9200

应用集成 OpenTelemetry

// Node.js 示例
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const provider = new NodeTracerProvider();

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger-collector:14268/api/traces',
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

常用 PromQL 查询

# CPU 使用率
rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])

# 内存使用率
container_memory_usage_bytes{pod=~"myapp-.*"} / container_spec_memory_limit_bytes{pod=~"myapp-.*"}

# HTTP 请求 QPS
rate(http_requests_total[5m])

# HTTP 错误率
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])

# HTTP 延迟 P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Pod 重启次数
increase(kube_pod_container_status_restarts_total[1h])

# 可用 Pod 数量
kube_deployment_status_replicas_available

# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes

最佳实践

1. 日志规范

// 结构化日志
{
  "timestamp": "2024-01-08T12:00:00Z",
  "level": "error",
  "message": "Failed to process request",
  "service": "myapp",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user123",
  "request_id": "req789",
  "error": {
    "type": "DatabaseError",
    "message": "Connection timeout"
  }
}

2. 指标命名

# 遵循 Prometheus 命名约定
http_requests_total          # Counter
http_request_duration_seconds # Histogram
memory_usage_bytes           # Gauge
queue_size                   # Gauge

3. 告警设计

  • 症状告警: 告警用户影响(延迟、错误率)
  • 原因告警: 告警根本原因(CPU、内存)
  • 分级告警: Critical、Warning、Info
  • 降噪: 合理设置 for 时间和阈值

4. Dashboard 设计

  • RED 方法: Rate、Errors、Duration
  • USE 方法: Utilization、Saturation、Errors
  • 分层设计: 总览 → 服务 → 实例

5. 数据保留

# Prometheus 保留 30 天
prometheus:
  retention: 30d

# Elasticsearch 索引生命周期
PUT _ilm/policy/k8s-logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_size": "50gb"
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

常用命令

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090

# Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# Kibana
kubectl port-forward -n logging svc/kibana 5601:5601

# 查看日志
kubectl logs -f <pod-name>
kubectl logs <pod-name> --previous  # 查看上一个容器日志
kubectl logs <pod-name> -c <container-name>  # 多容器

# 聚合日志
stern <pod-name> -n <namespace>
kubectl logs -l app=myapp --tail=100

小结

可观测性是保障系统稳定性的基础:

监控方案

  • Prometheus: 时序指标存储
  • Grafana: 可视化展示
  • Alertmanager: 告警管理

日志方案

  • EFK: Elasticsearch + Fluentd + Kibana(功能强大)
  • Loki: 轻量级方案(成本低)

追踪方案

  • Jaeger: 分布式追踪
  • OpenTelemetry: 统一标准

核心原则

  • 结构化日志
  • 有意义的指标
  • 合理的告警
  • 清晰的 Dashboard

下一章我们将学习故障排查,快速定位和解决问题。