监控告警体系

建立完善的监控告警体系是保障系统稳定运行的关键。本章介绍生产环境监控的最佳实践。

监控层次

┌─────────────────────────────────────────┐
│  业务监控                                │
│  ├─ API 响应时间                        │
│  ├─ 业务成功率                          │
│  └─ 用户行为分析                        │
├─────────────────────────────────────────┤
│  应用监控                                │
│  ├─ 应用性能 (APM)                      │
│  ├─ 日志监控                            │
│  └─ 错误追踪                            │
├─────────────────────────────────────────┤
│  系统监控                                │
│  ├─ CPU、内存、磁盘                     │
│  ├─ 网络流量                            │
│  └─ 进程状态                            │
├─────────────────────────────────────────┤
│  基础设施监控                            │
│  ├─ 服务器健康                          │
│  ├─ 网络设备                            │
│  └─ 存储系统                            │
└─────────────────────────────────────────┘

Prometheus 监控

安装配置

# 下载 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# 配置文件 prometheus.yml
cat > prometheus.yml << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: 
        - 'localhost:9100'
        - 'server1:9100'
        - 'server2:9100'
EOF

# 启动 Prometheus
./prometheus --config.file=prometheus.yml

Node Exporter

# 下载安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xvf node_exporter-1.6.0.linux-amd64.tar.gz
cd node_exporter-1.6.0.linux-amd64

# 创建 systemd 服务
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.mount-points-exclude='^/(sys|proc|dev|host|etc)($$|/)' \
  --collector.netclass.ignored-devices='^(veth.*|docker.*|br-.*|lo)$$'
Restart=always

[Install]
WantedBy=multi-user.target
EOF

# 启动服务
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter

# 验证
curl http://localhost:9100/metrics

告警规则

# alerts.yml
groups:
  - name: system_alerts
    interval: 30s
    rules:
      # CPU 告警
      - alert: HighCPU
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} CPU 使用率过高"
          description: "CPU 使用率: {{ $value }}%"

      # 内存告警
      - alert: HighMemory
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} 内存使用率过高"
          description: "内存使用率: {{ $value }}%"

      # 磁盘空间告警
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 磁盘空间不足"
          description: "挂载点 {{ $labels.mountpoint }} 剩余空间: {{ $value }}%"

      # 服务下线告警
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 下线"
          description: "{{ $labels.job }} 服务无法访问"

      # 磁盘 I/O 告警
      - alert: HighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} 磁盘 I/O 过高"
          description: "设备 {{ $labels.device }} I/O 使用率: {{ $value }}"

      # 网络流量告警
      - alert: HighNetworkTraffic
        expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "实例 {{ $labels.instance }} 网络流量异常"
          description: "接口 {{ $labels.device }} 接收速率: {{ $value | humanize }}B/s"

Alertmanager 配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'alertmanager@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'
      continue: true
    - match:
        severity: warning
      receiver: 'warning'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        headers:
          Subject: '[Prometheus] {{ .GroupLabels.alertname }}'

  - name: 'critical'
    email_configs:
      - to: 'oncall@example.com'
        send_resolved: true
    webhook_configs:
      - url: 'http://webhook.example.com/alerts'

  - name: 'warning'
    email_configs:
      - to: 'ops@example.com'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana 可视化

安装配置

# 安装 Grafana
wget https://dl.grafana.com/oss/release/grafana-10.0.0.linux-amd64.tar.gz
tar -zxvf grafana-10.0.0.linux-amd64.tar.gz
cd grafana-10.0.0

# 配置文件 conf/defaults.ini
# 主要修改：
# [server]
# http_port = 3000
# domain = localhost

# 启动
./bin/grafana-server

# 访问 http://localhost:3000
# 默认用户名/密码: admin/admin

数据源配置

# 添加 Prometheus 数据源（通过 UI 或 API）
curl -X POST http://admin:admin@localhost:3000/api/datasources \
  -H "Content-Type: application/json" \
  -d '{
    "name":"Prometheus",
    "type":"prometheus",
    "url":"http://localhost:9090",
    "access":"proxy",
    "isDefault":true
  }'

常用面板

// CPU 使用率面板
{
  "targets": [
    {
      "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
    }
  ],
  "title": "CPU Usage"
}

// 内存使用率
{
  "targets": [
    {
      "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
    }
  ],
  "title": "Memory Usage"
}

// 磁盘 I/O
{
  "targets": [
    {
      "expr": "rate(node_disk_read_bytes_total[5m])"
    },
    {
      "expr": "rate(node_disk_written_bytes_total[5m])"
    }
  ],
  "title": "Disk I/O"
}

// 网络流量
{
  "targets": [
    {
      "expr": "rate(node_network_receive_bytes_total[5m])"
    },
    {
      "expr": "rate(node_network_transmit_bytes_total[5m])"
    }
  ],
  "title": "Network Traffic"
}

日志监控

ELK Stack

# Elasticsearch 安装
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.8.0-linux-x86_64.tar.gz
tar xzf elasticsearch-8.8.0-linux-x86_64.tar.gz
cd elasticsearch-8.8.0
./bin/elasticsearch

# Logstash 配置
cat > logstash.conf << 'EOF'
input {
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx-access"
    start_position => "beginning"
  }
  file {
    path => "/var/log/nginx/error.log"
    type => "nginx-error"
    start_position => "beginning"
  }
}

filter {
  if [type] == "nginx-access" {
    grok {
      match => {
        "message" => "%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}\" %{INT:status} %{INT:body_bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\""
      }
    }
    date {
      match => [ "time_local", "dd/MMM/yyyy:HH:mm:ss Z" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "nginx-%{+YYYY.MM.dd}"
  }
}
EOF

# 启动 Logstash
./bin/logstash -f logstash.conf

# Kibana 安装
wget https://artifacts.elastic.co/downloads/kibana/kibana-8.8.0-linux-x86_64.tar.gz
tar xzf kibana-8.8.0-linux-x86_64.tar.gz
cd kibana-8.8.0
./bin/kibana

日志分析查询

# Elasticsearch 查询示例

# 搜索错误日志
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "status": "500"
    }
  }
}
'

# 聚合查询：按状态码统计
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "status_codes": {
      "terms": {
        "field": "status.keyword",
        "size": 10
      }
    }
  }
}
'

# 时间范围查询
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-1h",
        "lte": "now"
      }
    }
  }
}
'

自定义监控脚本

系统监控脚本

#!/bin/bash
# system-monitor.sh - 系统监控采集

METRICS_FILE="/var/lib/prometheus/node_exporter/textfile_collector/system_metrics.prom"
mkdir -p $(dirname "$METRICS_FILE")

# 清空文件
> "$METRICS_FILE"

# CPU 温度
if command -v sensors >/dev/null 2>&1; then
    temp=$(sensors | grep 'Core 0' | awk '{print $3}' | tr -d '+°C')
    echo "system_cpu_temperature{sensor=\"core0\"} $temp" >> "$METRICS_FILE"
fi

# 登录用户数
users=$(who | wc -l)
echo "system_logged_users $users" >> "$METRICS_FILE"

# 僵尸进程数
zombies=$(ps aux | awk '$8=="Z"' | wc -l)
echo "system_zombie_processes $zombies" >> "$METRICS_FILE"

# 系统负载
load1=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $1}' | tr -d ' ')
load5=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $2}' | tr -d ' ')
load15=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $3}' | tr -d ' ')
echo "system_load_1min $load1" >> "$METRICS_FILE"
echo "system_load_5min $load5" >> "$METRICS_FILE"
echo "system_load_15min $load15" >> "$METRICS_FILE"

# TCP 连接状态统计
ss -tan | awk 'NR>1 {state[$1]++} END {for(s in state) print "system_tcp_connections{state=\""tolower(s)"\"} "state[s]}' >> "$METRICS_FILE"

# 文件描述符使用
fd_used=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
fd_max=$(cat /proc/sys/fs/file-nr | awk '{print $3}')
echo "system_file_descriptors_used $fd_used" >> "$METRICS_FILE"
echo "system_file_descriptors_max $fd_max" >> "$METRICS_FILE"

应用监控脚本

#!/bin/bash
# app-monitor.sh - 应用健康检查

APP_NAME="myapp"
APP_PORT=8080
METRICS_FILE="/var/lib/prometheus/node_exporter/textfile_collector/${APP_NAME}_metrics.prom"

> "$METRICS_FILE"

# 检查进程
if pgrep -x "$APP_NAME" > /dev/null; then
    echo "app_up{name=\"$APP_NAME\"} 1" >> "$METRICS_FILE"
    
    # 获取 PID
    PID=$(pgrep -x "$APP_NAME")
    
    # CPU 使用率
    cpu=$(ps -p $PID -o %cpu --no-headers)
    echo "app_cpu_percent{name=\"$APP_NAME\"} $cpu" >> "$METRICS_FILE"
    
    # 内存使用
    mem=$(ps -p $PID -o %mem --no-headers)
    echo "app_memory_percent{name=\"$APP_NAME\"} $mem" >> "$METRICS_FILE"
    
    # 线程数
    threads=$(ps -p $PID -o nlwp --no-headers)
    echo "app_threads{name=\"$APP_NAME\"} $threads" >> "$METRICS_FILE"
    
    # 打开的文件数
    fds=$(ls -1 /proc/$PID/fd 2>/dev/null | wc -l)
    echo "app_open_fds{name=\"$APP_NAME\"} $fds" >> "$METRICS_FILE"
else
    echo "app_up{name=\"$APP_NAME\"} 0" >> "$METRICS_FILE"
fi

# 检查端口
if netstat -tuln | grep -q ":$APP_PORT "; then
    echo "app_port_listening{name=\"$APP_NAME\",port=\"$APP_PORT\"} 1" >> "$METRICS_FILE"
else
    echo "app_port_listening{name=\"$APP_NAME\",port=\"$APP_PORT\"} 0" >> "$METRICS_FILE"
fi

# HTTP 健康检查
if curl -s -o /dev/null -w "%{http_code}" http://localhost:$APP_PORT/health | grep -q "200"; then
    echo "app_health_check{name=\"$APP_NAME\"} 1" >> "$METRICS_FILE"
    
    # 响应时间
    response_time=$(curl -s -o /dev/null -w "%{time_total}" http://localhost:$APP_PORT/health)
    echo "app_response_time_seconds{name=\"$APP_NAME\"} $response_time" >> "$METRICS_FILE"
else
    echo "app_health_check{name=\"$APP_NAME\"} 0" >> "$METRICS_FILE"
fi

定时任务配置

# 添加到 crontab
* * * * * /usr/local/bin/system-monitor.sh
* * * * * /usr/local/bin/app-monitor.sh

# 或使用 systemd timer
cat > /etc/systemd/system/system-monitor.timer << 'EOF'
[Unit]
Description=System Monitor Timer

[Timer]
OnBootSec=1min
OnUnitActiveSec=1min

[Install]
WantedBy=timers.target
EOF

cat > /etc/systemd/system/system-monitor.service << 'EOF'
[Unit]
Description=System Monitor Service

[Service]
Type=oneshot
ExecStart=/usr/local/bin/system-monitor.sh
EOF

systemctl daemon-reload
systemctl enable system-monitor.timer
systemctl start system-monitor.timer

告警通知

邮件告警

#!/bin/bash
# send-alert-email.sh

SUBJECT="$1"
BODY="$2"
TO="ops@example.com"

echo "$BODY" | mail -s "$SUBJECT" "$TO"

微信/钉钉告警

#!/bin/bash
# send-alert-webhook.sh

WEBHOOK_URL="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
MESSAGE="$1"

curl -X POST "$WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{
    \"msgtype\": \"text\",
    \"text\": {
      \"content\": \"$MESSAGE\"
    }
  }"

Slack 告警

#!/bin/bash
# send-alert-slack.sh

WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
MESSAGE="$1"

curl -X POST "$WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d "{
    \"text\": \"$MESSAGE\"
  }"

监控最佳实践

监控指标设计

黄金信号（Golden Signals）：
1. 延迟（Latency）- 请求响应时间
2. 流量（Traffic）- 请求速率
3. 错误（Errors）- 错误率
4. 饱和度（Saturation）- 资源利用率

USE 方法：
1. Utilization - 资源使用率
2. Saturation - 资源饱和度
3. Errors - 错误数

RED 方法：
1. Rate - 请求速率
2. Errors - 错误率
3. Duration - 持续时间

告警设计原则

1. 可操作性
   - 告警必须有明确的处理步骤
   - 避免无法处理的告警

2. 准确性
   - 减少误报
   - 设置合理的阈值
   - 使用时间窗口过滤抖动

3. 及时性
   - 关键告警实时通知
   - 非关键告警可以聚合

4. 优先级
   - Critical: 立即处理
   - Warning: 需要关注
   - Info: 仅记录

5. 避免告警疲劳
   - 合理设置告警频率
   - 实施告警抑制
   - 定期回顾和优化

总结：完善的监控告警体系是运维的基础，需要持续优化和完善。

故障排查实战