监控告警体系
建立完善的监控告警体系是保障系统稳定运行的关键。本章介绍生产环境监控的最佳实践。
监控层次
┌─────────────────────────────────────────┐
│ 业务监控 │
│ ├─ API 响应时间 │
│ ├─ 业务成功率 │
│ └─ 用户行为分析 │
├─────────────────────────────────────────┤
│ 应用监控 │
│ ├─ 应用性能 (APM) │
│ ├─ 日志监控 │
│ └─ 错误追踪 │
├─────────────────────────────────────────┤
│ 系统监控 │
│ ├─ CPU、内存、磁盘 │
│ ├─ 网络流量 │
│ └─ 进程状态 │
├─────────────────────────────────────────┤
│ 基础设施监控 │
│ ├─ 服务器健康 │
│ ├─ 网络设备 │
│ └─ 存储系统 │
└─────────────────────────────────────────┘
Prometheus 监控
安装配置
# 下载 Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
# 配置文件 prometheus.yml
cat > prometheus.yml << 'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets:
- 'localhost:9100'
- 'server1:9100'
- 'server2:9100'
EOF
# 启动 Prometheus
./prometheus --config.file=prometheus.yml
Node Exporter
# 下载安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
tar xvf node_exporter-1.6.0.linux-amd64.tar.gz
cd node_exporter-1.6.0.linux-amd64
# 创建 systemd 服务
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.mount-points-exclude='^/(sys|proc|dev|host|etc)($$|/)' \
--collector.netclass.ignored-devices='^(veth.*|docker.*|br-.*|lo)$$'
Restart=always
[Install]
WantedBy=multi-user.target
EOF
# 启动服务
systemctl daemon-reload
systemctl start node_exporter
systemctl enable node_exporter
# 验证
curl http://localhost:9100/metrics
告警规则
# alerts.yml
groups:
- name: system_alerts
interval: 30s
rules:
# CPU 告警
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "实例 {{ $labels.instance }} CPU 使用率过高"
description: "CPU 使用率: {{ $value }}%"
# 内存告警
- alert: HighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "实例 {{ $labels.instance }} 内存使用率过高"
description: "内存使用率: {{ $value }}%"
# 磁盘空间告警
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 磁盘空间不足"
description: "挂载点 {{ $labels.mountpoint }} 剩余空间: {{ $value }}%"
# 服务下线告警
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 下线"
description: "{{ $labels.job }} 服务无法访问"
# 磁盘 I/O 告警
- alert: HighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "实例 {{ $labels.instance }} 磁盘 I/O 过高"
description: "设备 {{ $labels.device }} I/O 使用率: {{ $value }}"
# 网络流量告警
- alert: HighNetworkTraffic
expr: rate(node_network_receive_bytes_total[5m]) > 100 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "实例 {{ $labels.instance }} 网络流量异常"
description: "接口 {{ $labels.device }} 接收速率: {{ $value | humanize }}B/s"
Alertmanager 配置
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
continue: true
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
headers:
Subject: '[Prometheus] {{ .GroupLabels.alertname }}'
- name: 'critical'
email_configs:
- to: 'oncall@example.com'
send_resolved: true
webhook_configs:
- url: 'http://webhook.example.com/alerts'
- name: 'warning'
email_configs:
- to: 'ops@example.com'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana 可视化
安装配置
# 安装 Grafana
wget https://dl.grafana.com/oss/release/grafana-10.0.0.linux-amd64.tar.gz
tar -zxvf grafana-10.0.0.linux-amd64.tar.gz
cd grafana-10.0.0
# 配置文件 conf/defaults.ini
# 主要修改:
# [server]
# http_port = 3000
# domain = localhost
# 启动
./bin/grafana-server
# 访问 http://localhost:3000
# 默认用户名/密码: admin/admin
数据源配置
# 添加 Prometheus 数据源(通过 UI 或 API)
curl -X POST http://admin:admin@localhost:3000/api/datasources \
-H "Content-Type: application/json" \
-d '{
"name":"Prometheus",
"type":"prometheus",
"url":"http://localhost:9090",
"access":"proxy",
"isDefault":true
}'
常用面板
// CPU 使用率面板
{
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
}
],
"title": "CPU Usage"
}
// 内存使用率
{
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
],
"title": "Memory Usage"
}
// 磁盘 I/O
{
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[5m])"
},
{
"expr": "rate(node_disk_written_bytes_total[5m])"
}
],
"title": "Disk I/O"
}
// 网络流量
{
"targets": [
{
"expr": "rate(node_network_receive_bytes_total[5m])"
},
{
"expr": "rate(node_network_transmit_bytes_total[5m])"
}
],
"title": "Network Traffic"
}
日志监控
ELK Stack
# Elasticsearch 安装
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.8.0-linux-x86_64.tar.gz
tar xzf elasticsearch-8.8.0-linux-x86_64.tar.gz
cd elasticsearch-8.8.0
./bin/elasticsearch
# Logstash 配置
cat > logstash.conf << 'EOF'
input {
file {
path => "/var/log/nginx/access.log"
type => "nginx-access"
start_position => "beginning"
}
file {
path => "/var/log/nginx/error.log"
type => "nginx-error"
start_position => "beginning"
}
}
filter {
if [type] == "nginx-access" {
grok {
match => {
"message" => "%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:request_uri} HTTP/%{NUMBER:http_version}\" %{INT:status} %{INT:body_bytes_sent} \"%{DATA:http_referer}\" \"%{DATA:http_user_agent}\""
}
}
date {
match => [ "time_local", "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "nginx-%{+YYYY.MM.dd}"
}
}
EOF
# 启动 Logstash
./bin/logstash -f logstash.conf
# Kibana 安装
wget https://artifacts.elastic.co/downloads/kibana/kibana-8.8.0-linux-x86_64.tar.gz
tar xzf kibana-8.8.0-linux-x86_64.tar.gz
cd kibana-8.8.0
./bin/kibana
日志分析查询
# Elasticsearch 查询示例
# 搜索错误日志
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"status": "500"
}
}
}
'
# 聚合查询:按状态码统计
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"status_codes": {
"terms": {
"field": "status.keyword",
"size": 10
}
}
}
}
'
# 时间范围查询
curl -X GET "localhost:9200/nginx-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"@timestamp": {
"gte": "now-1h",
"lte": "now"
}
}
}
}
'
自定义监控脚本
系统监控脚本
#!/bin/bash
# system-monitor.sh - 系统监控采集
METRICS_FILE="/var/lib/prometheus/node_exporter/textfile_collector/system_metrics.prom"
mkdir -p $(dirname "$METRICS_FILE")
# 清空文件
> "$METRICS_FILE"
# CPU 温度
if command -v sensors >/dev/null 2>&1; then
temp=$(sensors | grep 'Core 0' | awk '{print $3}' | tr -d '+°C')
echo "system_cpu_temperature{sensor=\"core0\"} $temp" >> "$METRICS_FILE"
fi
# 登录用户数
users=$(who | wc -l)
echo "system_logged_users $users" >> "$METRICS_FILE"
# 僵尸进程数
zombies=$(ps aux | awk '$8=="Z"' | wc -l)
echo "system_zombie_processes $zombies" >> "$METRICS_FILE"
# 系统负载
load1=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $1}' | tr -d ' ')
load5=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $2}' | tr -d ' ')
load15=$(uptime | awk -F'load average:' '{print $2}' | awk -F, '{print $3}' | tr -d ' ')
echo "system_load_1min $load1" >> "$METRICS_FILE"
echo "system_load_5min $load5" >> "$METRICS_FILE"
echo "system_load_15min $load15" >> "$METRICS_FILE"
# TCP 连接状态统计
ss -tan | awk 'NR>1 {state[$1]++} END {for(s in state) print "system_tcp_connections{state=\""tolower(s)"\"} "state[s]}' >> "$METRICS_FILE"
# 文件描述符使用
fd_used=$(cat /proc/sys/fs/file-nr | awk '{print $1}')
fd_max=$(cat /proc/sys/fs/file-nr | awk '{print $3}')
echo "system_file_descriptors_used $fd_used" >> "$METRICS_FILE"
echo "system_file_descriptors_max $fd_max" >> "$METRICS_FILE"
应用监控脚本
#!/bin/bash
# app-monitor.sh - 应用健康检查
APP_NAME="myapp"
APP_PORT=8080
METRICS_FILE="/var/lib/prometheus/node_exporter/textfile_collector/${APP_NAME}_metrics.prom"
> "$METRICS_FILE"
# 检查进程
if pgrep -x "$APP_NAME" > /dev/null; then
echo "app_up{name=\"$APP_NAME\"} 1" >> "$METRICS_FILE"
# 获取 PID
PID=$(pgrep -x "$APP_NAME")
# CPU 使用率
cpu=$(ps -p $PID -o %cpu --no-headers)
echo "app_cpu_percent{name=\"$APP_NAME\"} $cpu" >> "$METRICS_FILE"
# 内存使用
mem=$(ps -p $PID -o %mem --no-headers)
echo "app_memory_percent{name=\"$APP_NAME\"} $mem" >> "$METRICS_FILE"
# 线程数
threads=$(ps -p $PID -o nlwp --no-headers)
echo "app_threads{name=\"$APP_NAME\"} $threads" >> "$METRICS_FILE"
# 打开的文件数
fds=$(ls -1 /proc/$PID/fd 2>/dev/null | wc -l)
echo "app_open_fds{name=\"$APP_NAME\"} $fds" >> "$METRICS_FILE"
else
echo "app_up{name=\"$APP_NAME\"} 0" >> "$METRICS_FILE"
fi
# 检查端口
if netstat -tuln | grep -q ":$APP_PORT "; then
echo "app_port_listening{name=\"$APP_NAME\",port=\"$APP_PORT\"} 1" >> "$METRICS_FILE"
else
echo "app_port_listening{name=\"$APP_NAME\",port=\"$APP_PORT\"} 0" >> "$METRICS_FILE"
fi
# HTTP 健康检查
if curl -s -o /dev/null -w "%{http_code}" http://localhost:$APP_PORT/health | grep -q "200"; then
echo "app_health_check{name=\"$APP_NAME\"} 1" >> "$METRICS_FILE"
# 响应时间
response_time=$(curl -s -o /dev/null -w "%{time_total}" http://localhost:$APP_PORT/health)
echo "app_response_time_seconds{name=\"$APP_NAME\"} $response_time" >> "$METRICS_FILE"
else
echo "app_health_check{name=\"$APP_NAME\"} 0" >> "$METRICS_FILE"
fi
定时任务配置
# 添加到 crontab
* * * * * /usr/local/bin/system-monitor.sh
* * * * * /usr/local/bin/app-monitor.sh
# 或使用 systemd timer
cat > /etc/systemd/system/system-monitor.timer << 'EOF'
[Unit]
Description=System Monitor Timer
[Timer]
OnBootSec=1min
OnUnitActiveSec=1min
[Install]
WantedBy=timers.target
EOF
cat > /etc/systemd/system/system-monitor.service << 'EOF'
[Unit]
Description=System Monitor Service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/system-monitor.sh
EOF
systemctl daemon-reload
systemctl enable system-monitor.timer
systemctl start system-monitor.timer
告警通知
邮件告警
#!/bin/bash
# send-alert-email.sh
SUBJECT="$1"
BODY="$2"
TO="ops@example.com"
echo "$BODY" | mail -s "$SUBJECT" "$TO"
微信/钉钉告警
#!/bin/bash
# send-alert-webhook.sh
WEBHOOK_URL="https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
MESSAGE="$1"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{
\"msgtype\": \"text\",
\"text\": {
\"content\": \"$MESSAGE\"
}
}"
Slack 告警
#!/bin/bash
# send-alert-slack.sh
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
MESSAGE="$1"
curl -X POST "$WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"$MESSAGE\"
}"
监控最佳实践
监控指标设计
黄金信号(Golden Signals):
1. 延迟(Latency)- 请求响应时间
2. 流量(Traffic)- 请求速率
3. 错误(Errors)- 错误率
4. 饱和度(Saturation)- 资源利用率
USE 方法:
1. Utilization - 资源使用率
2. Saturation - 资源饱和度
3. Errors - 错误数
RED 方法:
1. Rate - 请求速率
2. Errors - 错误率
3. Duration - 持续时间
告警设计原则
1. 可操作性
- 告警必须有明确的处理步骤
- 避免无法处理的告警
2. 准确性
- 减少误报
- 设置合理的阈值
- 使用时间窗口过滤抖动
3. 及时性
- 关键告警实时通知
- 非关键告警可以聚合
4. 优先级
- Critical: 立即处理
- Warning: 需要关注
- Info: 仅记录
5. 避免告警疲劳
- 合理设置告警频率
- 实施告警抑制
- 定期回顾和优化
总结:完善的监控告警体系是运维的基础,需要持续优化和完善。