Grafana 可视化
Grafana 可视化
Grafana 是开源的可视化和分析平台,用于展示 Prometheus 的监控数据。
访问 Grafana
# 端口转发
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 访问 http://localhost:3000
# 默认用户名: admin
# 默认密码: prom-operator (或在安装时设置的密码)
导入预设 Dashboard
Grafana 社区提供了大量预设 Dashboard,可以直接导入使用。
常用 Dashboard ID
# 进入 Grafana → Dashboards → Import → 输入 ID
# Kubernetes 集群监控
15760 # Kubernetes / Views / Global
15759 # Kubernetes / Views / Namespaces
15758 # Kubernetes / Views / Pods
15757 # Kubernetes / Views / Nodes
# Node Exporter(节点监控)
1860 # Node Exporter Full
11074 # Node Exporter for Prometheus Dashboard
# Kubernetes 资源监控
6417 # Kubernetes Deployment Statefulset Daemonset metrics
7249 # Kubernetes Cluster
8588 # Kubernetes Pod Resources
# Nginx Ingress
9614 # NGINX Ingress controller
11274 # NGINX Ingress Controller (Community)
手动导入 Dashboard
# 1. 访问 https://grafana.com/grafana/dashboards/
# 2. 搜索需要的 Dashboard
# 3. 复制 JSON 或 ID
# 4. 在 Grafana 中导入
创建自定义 Dashboard
Dashboard 配置
{
"dashboard": {
"title": "MyApp Monitoring",
"tags": ["kubernetes", "myapp"],
"timezone": "browser",
"schemaVersion": 27,
"panels": [
{
"id": 1,
"title": "Request Rate (QPS)",
"type": "graph",
"gridPos": {
"x": 0,
"y": 0,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"myapp\"}[5m])) by (pod)",
"legendFormat": "{{pod}}",
"refId": "A"
}
],
"yaxes": [
{
"label": "requests/sec",
"format": "short"
}
]
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {
"x": 12,
"y": 0,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"myapp\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"myapp\"}[5m]))",
"legendFormat": "Error Rate",
"refId": "A"
}
],
"yaxes": [
{
"label": "error rate",
"format": "percentunit"
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.05],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"params": [],
"type": "avg"
},
"type": "query"
}
],
"frequency": "1m",
"handler": 1,
"name": "High Error Rate Alert",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "Response Time (P95)",
"type": "graph",
"gridPos": {
"x": 0,
"y": 8,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m])) by (le, pod))",
"legendFormat": "{{pod}} - P95",
"refId": "A"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m])) by (le, pod))",
"legendFormat": "{{pod}} - P99",
"refId": "B"
}
],
"yaxes": [
{
"label": "seconds",
"format": "s"
}
]
},
{
"id": 4,
"title": "Pod CPU Usage",
"type": "graph",
"gridPos": {
"x": 12,
"y": 8,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{pod=~\"myapp-.*\"}[5m])) by (pod)",
"legendFormat": "{{pod}}",
"refId": "A"
}
],
"yaxes": [
{
"label": "cores",
"format": "short"
}
]
},
{
"id": 5,
"title": "Pod Memory Usage",
"type": "graph",
"gridPos": {
"x": 0,
"y": 16,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "sum(container_memory_usage_bytes{pod=~\"myapp-.*\"}) by (pod)",
"legendFormat": "{{pod}}",
"refId": "A"
}
],
"yaxes": [
{
"label": "bytes",
"format": "bytes"
}
]
}
]
}
}
使用 ConfigMap 部署 Dashboard
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
myapp-dashboard.json: |
{
"dashboard": {
"title": "MyApp Monitoring",
"panels": [...]
}
}
Alertmanager 集成
配置告警通知
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
# 路由配置
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s # 等待同组告警
group_interval: 10s # 同组告警间隔
repeat_interval: 12h # 重复告警间隔
receiver: 'default'
routes:
# Critical 告警
- match:
severity: critical
receiver: 'critical'
continue: true
group_wait: 0s
# Warning 告警
- match:
severity: warning
receiver: 'warning'
group_wait: 30s
# 特定服务告警
- match:
service: database
receiver: 'database-team'
# 接收器配置
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical'
slack_configs:
- channel: '#alerts-critical'
title: '🔥 Critical: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
- name: 'warning'
slack_configs:
- channel: '#alerts-warning'
title: '⚠️ Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'database-team'
slack_configs:
- channel: '#db-alerts'
email_configs:
- to: 'db-team@example.com'
# 抑制规则
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
Slack 通知模板
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
username: 'Prometheus Alert'
icon_emoji: ':alert:'
title: '{{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Pod:* {{ .Labels.pod }}
*Namespace:* {{ .Labels.namespace }}
{{ end }}
Panel 类型
1. Graph Panel(时序图)
{
"type": "graph",
"title": "CPU Usage",
"targets": [
{
"expr": "rate(container_cpu_usage_seconds_total[5m])"
}
]
}
2. Stat Panel(数值面板)
{
"type": "stat",
"title": "Total Requests",
"targets": [
{
"expr": "sum(http_requests_total)"
}
],
"options": {
"colorMode": "background",
"graphMode": "area"
}
}
3. Table Panel(表格)
{
"type": "table",
"title": "Top Pods by Memory",
"targets": [
{
"expr": "topk(10, container_memory_usage_bytes)",
"format": "table"
}
]
}
4. Heatmap(热力图)
{
"type": "heatmap",
"title": "Latency Distribution",
"targets": [
{
"expr": "rate(http_request_duration_seconds_bucket[5m])"
}
]
}
Variables(变量)
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)",
"multi": false
},
{
"name": "pod",
"type": "query",
"query": "label_values(kube_pod_info{namespace=\"$namespace\"}, pod)",
"multi": true
}
]
}
}
使用变量:
rate(container_cpu_usage_seconds_total{namespace="$namespace", pod=~"$pod"}[5m])
Dashboard 最佳实践
1. RED 方法
针对服务监控:
- Rate: 请求速率
- Errors: 错误率
- Duration: 响应时间
2. USE 方法
针对资源监控:
- Utilization: 利用率(CPU、内存使用率)
- Saturation: 饱和度(队列长度)
- Errors: 错误(失败次数)
3. 分层设计
Level 1: 总览 Dashboard
- 整体健康状态
- 关键指标汇总
Level 2: 服务 Dashboard
- 单个服务详细指标
- 依赖服务状态
Level 3: 实例 Dashboard
- Pod 级别详细信息
- 容器资源使用
告警通知测试
# 测试 Alertmanager 配置
amtool check-config alertmanager.yaml
# 发送测试告警
curl -X POST http://alertmanager:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert"
}
}
]'
小结
本节介绍了 Grafana 可视化:
✅ Dashboard 导入:预设 Dashboard 和自定义配置
✅ Panel 类型:Graph、Stat、Table、Heatmap
✅ Alertmanager:告警路由和通知配置
✅ 最佳实践:RED/USE 方法、分层设计
✅ 变量使用:动态 Dashboard
下一节:EFK 日志栈。