Prometheus 监控系统

Prometheus 是云原生监控的事实标准。

架构

┌─────────────┐     Pull     ┌──────────────┐
│ Application │◄─────────────│  Prometheus  │
│  /metrics   │              │    Server    │
└─────────────┘              └──────┬───────┘
                                    │
            ┌───────────────────────┼──────────┐
            │                       │          │
     ┌──────▼──────┐        ┌──────▼────┐ ┌──▼────┐
     │  Grafana    │        │Alertmanag │ │ TSDB  │
     │  (可视化)    │        │er (告警)   │ │       │
     └─────────────┘        └───────────┘ └───────┘

安装 kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

ServiceMonitor 配置

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: production
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

PrometheusRule 告警

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: monitoring
spec:
  groups:
  - name: myapp.rules
    interval: 30s
    rules:
    - alert: HighErrorRate
      expr: |
        rate(http_requests_total{status_code=~"5.."}[5m]) 
        / rate(http_requests_total[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "错误率过高"

常用 PromQL

# CPU 使用率
rate(container_cpu_usage_seconds_total[5m])

# 内存使用率
container_memory_usage_bytes / container_spec_memory_limit_bytes

# HTTP QPS
rate(http_requests_total[5m])

# P95 延迟
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

小结

✅ Prometheus 安装和配置
✅ ServiceMonitor 指标采集
✅ PrometheusRule 告警规则
✅ PromQL 常用查询

下一节：Grafana 可视化。