监控与日志
监控与日志
可观测性三大支柱
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Metrics │ │ Logs │ │ Traces │
│ (指标监控) │ │ (日志收集) │ │ (链路追踪) │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└─────────────────┴─────────────────┘
│
┌──────────▼──────────┐
│ 可观测性平台 │
│ (Observability) │
└────────────────────┘
核心目标:
- 指标: 了解系统状态
- 日志: 了解发生了什么
- 追踪: 了解请求流程
Prometheus - 指标监控
架构概览
┌─────────────┐ Pull Metrics ┌──────────────┐
│ Application │◄────────────────────│ Prometheus │
│ /metrics │ │ Server │
└─────────────┘ └──────┬───────┘
│
┌──────────────────────┼────────────┐
│ │ │
┌──────▼──────┐ ┌─────▼────┐ ┌───▼──────┐
│ Grafana │ │AlertMana │ │ Time │
│ (可视化) │ │ger(告警) │ │ Series DB│
└─────────────┘ └──────────┘ └──────────┘
安装 Prometheus Stack
# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装 kube-prometheus-stack (包含 Prometheus、Grafana、Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=admin123
# 查看部署状态
kubectl get pods -n monitoring
访问 Prometheus
# 端口转发 Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# 访问: http://localhost:9090
# 端口转发 Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 访问: http://localhost:3000 (admin/admin123)
应用暴露指标
Node.js 示例
// app.js
const express = require('express');
const promClient = require('prom-client');
const app = express();
// 创建默认指标
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });
// 自定义指标
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5]
});
register.registerMetric(httpRequestDuration);
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestTotal);
// 中间件记录指标
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration.labels(req.method, req.route?.path || req.path, res.statusCode).observe(duration);
httpRequestTotal.labels(req.method, req.route?.path || req.path, res.statusCode).inc();
});
next();
});
// 暴露指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000);
ServiceMonitor 配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp-monitor
namespace: production
labels:
release: prometheus # 必须匹配 Prometheus 的 serviceMonitorSelector
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
PodMonitor 配置
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: myapp-pod-monitor
namespace: production
spec:
selector:
matchLabels:
app: myapp
podMetricsEndpoints:
- port: metrics
interval: 30s
path: /metrics
PrometheusRule - 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: monitoring
spec:
groups:
- name: myapp.rules
interval: 30s
rules:
# Pod 不可用告警
- alert: PodDown
expr: up{job="myapp"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is down"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has been down for more than 5 minutes."
# 高错误率告警
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
# 高延迟告警
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
# 内存使用率告警
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes{pod=~"myapp-.*"}
/
container_spec_memory_limit_bytes{pod=~"myapp-.*"}) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Pod {{ $labels.pod }} memory usage is {{ $value | humanizePercentage }}"
# CPU 使用率告警
- alert: HighCPUUsage
expr: |
rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage"
description: "Pod {{ $labels.pod }} CPU usage is {{ $value | humanizePercentage }}"
Alertmanager 配置
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical'
slack_configs:
- channel: '#alerts-critical'
title: '🔥 Critical Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'warning'
slack_configs:
- channel: '#alerts-warning'
title: '⚠️ Warning Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Grafana - 数据可视化
导入预设 Dashboard
# 登录 Grafana
# 访问 Dashboards → Import
# 输入以下 Dashboard ID:
# Kubernetes 集群监控
15760 # Kubernetes / Views / Global
15759 # Kubernetes / Views / Namespaces
15758 # Kubernetes / Views / Pods
# Node Exporter
1860 # Node Exporter Full
# Application Metrics
6417 # Kubernetes Deployment Statefulset Daemonset metrics
自定义 Dashboard JSON
{
"dashboard": {
"title": "MyApp Metrics",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"myapp\"}[5m])",
"legendFormat": "{{pod}}"
}
]
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{job=\"myapp\",status_code=~\"5..\"}[5m])",
"legendFormat": "{{pod}} - {{status_code}}"
}
]
},
{
"id": 3,
"title": "Latency (P95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{job=\"myapp\"}[5m]))",
"legendFormat": "{{pod}}"
}
]
}
]
}
}
ConfigMap 方式部署 Dashboard
apiVersion: v1
kind: ConfigMap
metadata:
name: myapp-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
myapp-dashboard.json: |
{
"dashboard": {
"title": "MyApp Dashboard",
"panels": [...]
}
}
日志收集 - EFK Stack
架构
┌─────────────┐
│ Application │
│ (stdout) │
└──────┬──────┘
│
┌──────▼──────┐
│ Fluentd │ ◄── 收集日志
│ DaemonSet │
└──────┬──────┘
│
┌──────▼──────────┐
│ Elasticsearch │ ◄── 存储日志
│ Cluster │
└──────┬──────────┘
│
┌──────▼──────┐
│ Kibana │ ◄── 查询展示
└─────────────┘
安装 Elasticsearch
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: logging
spec:
clusterIP: None
ports:
- port: 9200
name: http
- port: 9300
name: transport
selector:
app: elasticsearch
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: increase-vm-max-map
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
env:
- name: cluster.name
value: "k8s-logs"
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: ES_JAVA_OPTS
value: "-Xms2g -Xmx2g"
- name: xpack.security.enabled
value: "false"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
resources:
requests:
memory: 2Gi
cpu: 1
limits:
memory: 4Gi
cpu: 2
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
安装 Fluentd
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix k8s
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_interval 5s
retry_forever false
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
</buffer>
</match>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluentd/etc/fluent.conf
subPath: fluent.conf
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 500m
memory: 500Mi
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluentd-config
安装 Kibana
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:8.11.0
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
ports:
- containerPort: 5601
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
---
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: logging
spec:
type: NodePort
ports:
- port: 5601
targetPort: 5601
selector:
app: kibana
Loki - 轻量级日志方案
安装 Loki Stack
# 使用 Helm 安装 Loki + Promtail + Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace \
--set grafana.enabled=true \
--set prometheus.enabled=true \
--set promtail.enabled=true
Promtail 配置
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: logging
data:
promtail.yaml: |
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_pod_container_name]
target_label: container
链路追踪 - Jaeger
安装 Jaeger
kubectl create namespace tracing
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml
Jaeger 实例
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: tracing
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch.logging:9200
应用集成 OpenTelemetry
// Node.js 示例
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const provider = new NodeTracerProvider();
const exporter = new JaegerExporter({
endpoint: 'http://jaeger-collector:14268/api/traces',
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
常用 PromQL 查询
# CPU 使用率
rate(container_cpu_usage_seconds_total{pod=~"myapp-.*"}[5m])
# 内存使用率
container_memory_usage_bytes{pod=~"myapp-.*"} / container_spec_memory_limit_bytes{pod=~"myapp-.*"}
# HTTP 请求 QPS
rate(http_requests_total[5m])
# HTTP 错误率
rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m])
# HTTP 延迟 P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Pod 重启次数
increase(kube_pod_container_status_restarts_total[1h])
# 可用 Pod 数量
kube_deployment_status_replicas_available
# 磁盘使用率
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes
最佳实践
1. 日志规范
// 结构化日志
{
"timestamp": "2024-01-08T12:00:00Z",
"level": "error",
"message": "Failed to process request",
"service": "myapp",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user123",
"request_id": "req789",
"error": {
"type": "DatabaseError",
"message": "Connection timeout"
}
}
2. 指标命名
# 遵循 Prometheus 命名约定
http_requests_total # Counter
http_request_duration_seconds # Histogram
memory_usage_bytes # Gauge
queue_size # Gauge
3. 告警设计
- 症状告警: 告警用户影响(延迟、错误率)
- 原因告警: 告警根本原因(CPU、内存)
- 分级告警: Critical、Warning、Info
- 降噪: 合理设置 for 时间和阈值
4. Dashboard 设计
- RED 方法: Rate、Errors、Duration
- USE 方法: Utilization、Saturation、Errors
- 分层设计: 总览 → 服务 → 实例
5. 数据保留
# Prometheus 保留 30 天
prometheus:
retention: 30d
# Elasticsearch 索引生命周期
PUT _ilm/policy/k8s-logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "50gb"
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}
常用命令
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Kibana
kubectl port-forward -n logging svc/kibana 5601:5601
# 查看日志
kubectl logs -f <pod-name>
kubectl logs <pod-name> --previous # 查看上一个容器日志
kubectl logs <pod-name> -c <container-name> # 多容器
# 聚合日志
stern <pod-name> -n <namespace>
kubectl logs -l app=myapp --tail=100
小结
可观测性是保障系统稳定性的基础:
监控方案:
- Prometheus: 时序指标存储
- Grafana: 可视化展示
- Alertmanager: 告警管理
日志方案:
- EFK: Elasticsearch + Fluentd + Kibana(功能强大)
- Loki: 轻量级方案(成本低)
追踪方案:
- Jaeger: 分布式追踪
- OpenTelemetry: 统一标准
核心原则:
- 结构化日志
- 有意义的指标
- 合理的告警
- 清晰的 Dashboard
下一章我们将学习故障排查,快速定位和解决问题。