Grafana 可视化

Grafana 是开源的可视化和分析平台，用于展示 Prometheus 的监控数据。

访问 Grafana

# 端口转发
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

# 访问 http://localhost:3000
# 默认用户名: admin
# 默认密码: prom-operator (或在安装时设置的密码)

导入预设 Dashboard

Grafana 社区提供了大量预设 Dashboard，可以直接导入使用。

常用 Dashboard ID

# 进入 Grafana → Dashboards → Import → 输入 ID

# Kubernetes 集群监控
15760  # Kubernetes / Views / Global
15759  # Kubernetes / Views / Namespaces  
15758  # Kubernetes / Views / Pods
15757  # Kubernetes / Views / Nodes

# Node Exporter（节点监控）
1860   # Node Exporter Full
11074  # Node Exporter for Prometheus Dashboard

# Kubernetes 资源监控
6417   # Kubernetes Deployment Statefulset Daemonset metrics
7249   # Kubernetes Cluster
8588   # Kubernetes Pod Resources

# Nginx Ingress
9614   # NGINX Ingress controller
11274  # NGINX Ingress Controller (Community)

手动导入 Dashboard

# 1. 访问 https://grafana.com/grafana/dashboards/
# 2. 搜索需要的 Dashboard
# 3. 复制 JSON 或 ID
# 4. 在 Grafana 中导入

创建自定义 Dashboard

Dashboard 配置

{
  "dashboard": {
    "title": "MyApp Monitoring",
    "tags": ["kubernetes", "myapp"],
    "timezone": "browser",
    "schemaVersion": 27,
    "panels": [
      {
        "id": 1,
        "title": "Request Rate (QPS)",
        "type": "graph",
        "gridPos": {
          "x": 0,
          "y": 0,
          "w": 12,
          "h": 8
        },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"myapp\"}[5m])) by (pod)",
            "legendFormat": "{{pod}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "label": "requests/sec",
            "format": "short"
          }
        ]
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {
          "x": 12,
          "y": 0,
          "w": 12,
          "h": 8
        },
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{job=\"myapp\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"myapp\"}[5m]))",
            "legendFormat": "Error Rate",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "label": "error rate",
            "format": "percentunit"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.05],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "params": [],
                "type": "avg"
              },
              "type": "query"
            }
          ],
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate Alert",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "Response Time (P95)",
        "type": "graph",
        "gridPos": {
          "x": 0,
          "y": 8,
          "w": 12,
          "h": 8
        },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m])) by (le, pod))",
            "legendFormat": "{{pod}} - P95",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m])) by (le, pod))",
            "legendFormat": "{{pod}} - P99",
            "refId": "B"
          }
        ],
        "yaxes": [
          {
            "label": "seconds",
            "format": "s"
          }
        ]
      },
      {
        "id": 4,
        "title": "Pod CPU Usage",
        "type": "graph",
        "gridPos": {
          "x": 12,
          "y": 8,
          "w": 12,
          "h": 8
        },
        "targets": [
          {
            "expr": "sum(rate(container_cpu_usage_seconds_total{pod=~\"myapp-.*\"}[5m])) by (pod)",
            "legendFormat": "{{pod}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "label": "cores",
            "format": "short"
          }
        ]
      },
      {
        "id": 5,
        "title": "Pod Memory Usage",
        "type": "graph",
        "gridPos": {
          "x": 0,
          "y": 16,
          "w": 12,
          "h": 8
        },
        "targets": [
          {
            "expr": "sum(container_memory_usage_bytes{pod=~\"myapp-.*\"}) by (pod)",
            "legendFormat": "{{pod}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "label": "bytes",
            "format": "bytes"
          }
        ]
      }
    ]
  }
}

使用 ConfigMap 部署 Dashboard

apiVersion: v1
kind: ConfigMap
metadata:
  name: myapp-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  myapp-dashboard.json: |
    {
      "dashboard": {
        "title": "MyApp Monitoring",
        "panels": [...]
      }
    }

Alertmanager 集成

配置告警通知

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    
    # 路由配置
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s        # 等待同组告警
      group_interval: 10s     # 同组告警间隔
      repeat_interval: 12h    # 重复告警间隔
      receiver: 'default'
      
      routes:
      # Critical 告警
      - match:
          severity: critical
        receiver: 'critical'
        continue: true
        group_wait: 0s
      
      # Warning 告警
      - match:
          severity: warning
        receiver: 'warning'
        group_wait: 30s
      
      # 特定服务告警
      - match:
          service: database
        receiver: 'database-team'
    
    # 接收器配置
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: '🔥 Critical: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
      pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
      email_configs:
      - to: 'ops@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'
    
    - name: 'warning'
      slack_configs:
      - channel: '#alerts-warning'
        title: '⚠️ Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'database-team'
      slack_configs:
      - channel: '#db-alerts'
      email_configs:
      - to: 'db-team@example.com'
    
    # 抑制规则
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      equal: ['alertname', 'cluster', 'service']

Slack 通知模板

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    username: 'Prometheus Alert'
    icon_emoji: ':alert:'
    title: '{{ .GroupLabels.alertname }}'
    text: |
      {{ range .Alerts }}
      *Alert:* {{ .Labels.alertname }}
      *Severity:* {{ .Labels.severity }}
      *Summary:* {{ .Annotations.summary }}
      *Description:* {{ .Annotations.description }}
      *Pod:* {{ .Labels.pod }}
      *Namespace:* {{ .Labels.namespace }}
      {{ end }}

Panel 类型

1. Graph Panel（时序图）

{
  "type": "graph",
  "title": "CPU Usage",
  "targets": [
    {
      "expr": "rate(container_cpu_usage_seconds_total[5m])"
    }
  ]
}

2. Stat Panel（数值面板）

{
  "type": "stat",
  "title": "Total Requests",
  "targets": [
    {
      "expr": "sum(http_requests_total)"
    }
  ],
  "options": {
    "colorMode": "background",
    "graphMode": "area"
  }
}

3. Table Panel（表格）

{
  "type": "table",
  "title": "Top Pods by Memory",
  "targets": [
    {
      "expr": "topk(10, container_memory_usage_bytes)",
      "format": "table"
    }
  ]
}

4. Heatmap（热力图）

{
  "type": "heatmap",
  "title": "Latency Distribution",
  "targets": [
    {
      "expr": "rate(http_request_duration_seconds_bucket[5m])"
    }
  ]
}

Variables（变量）

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "query": "label_values(kube_pod_info, namespace)",
        "multi": false
      },
      {
        "name": "pod",
        "type": "query",
        "query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
        "multi": true
      }
    ]
  }
}

使用变量：

rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod"}[5m])

Dashboard 最佳实践

1. RED 方法

针对服务监控：

Rate: 请求速率
Errors: 错误率
Duration: 响应时间

2. USE 方法

针对资源监控：

Utilization: 利用率（CPU、内存使用率）
Saturation: 饱和度（队列长度）
Errors: 错误（失败次数）

3. 分层设计

Level 1: 总览 Dashboard
  - 整体健康状态
  - 关键指标汇总
  
Level 2: 服务 Dashboard
  - 单个服务详细指标
  - 依赖服务状态
  
Level 3: 实例 Dashboard
  - Pod 级别详细信息
  - 容器资源使用

告警通知测试

# 测试 Alertmanager 配置
amtool check-config alertmanager.yaml

# 发送测试告警
curl -X POST http://alertmanager:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestAlert",
      "severity": "warning"
    },
    "annotations": {
      "summary": "This is a test alert"
    }
  }
]'

小结

本节介绍了 Grafana 可视化：

✅ Dashboard 导入：预设 Dashboard 和自定义配置
✅ Panel 类型：Graph、Stat、Table、Heatmap
✅ Alertmanager：告警路由和通知配置
✅ 最佳实践：RED/USE 方法、分层设计
✅ 变量使用：动态 Dashboard

下一节：EFK 日志栈。