监控和日志系统
本章节详细介绍如何在 EKS 集群中部署完整的监控和日志系统,包括 Prometheus、Grafana、Alertmanager、ELK Stack 和 Jaeger 分布式追踪。
监控架构设计
监控层次
┌─────────────────────────────────────────────────────┐
│ 业务指标监控层 │
│ ├─ 订单量、转化率、用户活跃度 │
│ ├─ API 调用成功率、响应时间 │
│ └─ 业务异常和错误率 │
└────────────────────┬────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ 应用指标监控层 │
│ ├─ JVM/Go/Node.js 运行时指标 │
│ ├─ 应用日志和异常 │
│ ├─ 自定义业务指标 │
│ └─ 分布式追踪(Jaeger) │
└────────────────────┬────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ 容器和 Kubernetes 监控层 │
│ ├─ Pod CPU/内存/网络/磁盘 │
│ ├─ Container 资源使用 │
│ ├─ Kubernetes 事件 │
│ └─ HPA/VPA 扩展指标 │
└────────────────────┬────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ 基础设施监控层 │
│ ├─ EC2 实例指标(Node) │
│ ├─ RDS/Redis/DynamoDB 指标 │
│ ├─ ALB/NLB 指标 │
│ └─ VPC/NAT Gateway 指标 │
└─────────────────────────────────────────────────────┘
↓
Prometheus + Grafana
Alertmanager + PagerDuty
组件架构
┌─────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Prometheus Operator │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Prometheus Server (HA) │ │ │
│ │ │ ├─ ServiceMonitors │ │ │
│ │ │ ├─ PodMonitors │ │ │
│ │ │ └─ PrometheusRules │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Alertmanager (HA) │ │ │
│ │ │ ├─ Slack Notifications │ │ │
│ │ │ ├─ PagerDuty Integration │ │ │
│ │ │ └─ Email Alerts │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Grafana │ │ │
│ │ │ ├─ Dashboards │ │ │
│ │ │ ├─ Data Sources │ │ │
│ │ │ └─ Alert Rules │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ ELK Stack │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Elasticsearch (3 nodes) │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Kibana │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────┐ │ │
│ │ │ Filebeat (DaemonSet) │ │ │
│ │ └──────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────┐ │
│ │ Jaeger (Distributed Tracing) │ │
│ │ ├─ Jaeger Collector │ │
│ │ ├─ Jaeger Query │ │
│ │ └─ Jaeger Agent (Sidecar) │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Prometheus 部署
安装 Prometheus Operator
#!/bin/bash
# install-prometheus-operator.sh
NAMESPACE="monitoring"
echo "================================================"
echo "安装 Prometheus Operator"
echo "================================================"
# 1. 创建 namespace
echo ""
echo "1. 创建 monitoring namespace..."
kubectl create namespace $NAMESPACE
# 2. 添加 Helm 仓库
echo ""
echo "2. 添加 Prometheus Helm 仓库..."
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 3. 创建自定义 values 文件
echo ""
echo "3. 创建配置文件..."
cat > prometheus-values.yaml << 'EOF'
# Prometheus Operator 配置
prometheus:
prometheusSpec:
# 资源配置
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
# 存储配置
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
# 数据保留
retention: 30d
retentionSize: "45GB"
# 高可用配置
replicas: 2
# Pod 反亲和性
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- prometheus
topologyKey: topology.kubernetes.io/zone
# Service Monitor 选择器
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
# Pod Monitor 选择器
podMonitorSelector: {}
podMonitorNamespaceSelector: {}
# 额外的抓取配置
additionalScrapeConfigs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Alertmanager 配置
alertmanager:
alertmanagerSpec:
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 200m
memory: 512Mi
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
replicas: 3
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- alertmanager
topologyKey: topology.kubernetes.io/zone
# Grafana 配置
grafana:
enabled: true
adminPassword: "ChangeMe123!"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
persistence:
enabled: true
storageClassName: gp3
size: 10Gi
# 数据源
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-operated:9090
access: proxy
isDefault: true
# 仪表盘提供者
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
# 预装仪表盘
dashboards:
default:
kubernetes-cluster:
gnetId: 7249
revision: 1
datasource: Prometheus
kubernetes-pods:
gnetId: 6417
revision: 1
datasource: Prometheus
node-exporter:
gnetId: 1860
revision: 27
datasource: Prometheus
# Prometheus Node Exporter
prometheus-node-exporter:
enabled: true
# Kube State Metrics
kube-state-metrics:
enabled: true
EOF
# 4. 安装 Prometheus Stack
echo ""
echo "4. 安装 Prometheus Stack..."
helm install prometheus-stack prometheus-community/kube-prometheus-stack \
-n $NAMESPACE \
-f prometheus-values.yaml
echo ""
echo "5. 等待 Pods 就绪..."
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=prometheus \
-n $NAMESPACE \
--timeout=300s
echo ""
echo "================================================"
echo "Prometheus Operator 安装完成!"
echo "================================================"
echo ""
echo "访问方式:"
echo " Prometheus: kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-prometheus 9090:9090"
echo " Grafana: kubectl port-forward -n monitoring svc/prometheus-stack-grafana 3000:80"
echo " Alertmanager: kubectl port-forward -n monitoring svc/prometheus-stack-kube-prom-alertmanager 9093:9093"
echo ""
echo "Grafana 默认凭证:"
echo " 用户名: admin"
echo " 密码: ChangeMe123!"
echo "================================================"
配置 Alertmanager
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-stack-kube-prom-alertmanager
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# 严重告警
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: critical
receiver: 'slack-critical'
# 警告告警
- match:
severity: warning
receiver: 'slack-warning'
# 信息告警
- match:
severity: info
receiver: 'slack-info'
receivers:
# 默认接收器
- name: 'default'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# 严重告警 - PagerDuty
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
# 严重告警 - Slack
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
color: 'danger'
title: '🚨 Critical Alert: {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }}
*Severity:* {{ .Labels.severity }}
*Description:* {{ .Annotations.description }}
*Details:* {{ .Annotations.summary }}
{{ end }}
# 警告告警 - Slack
- name: 'slack-warning'
slack_configs:
- channel: '#alerts-warning'
color: 'warning'
title: '⚠️ Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# 信息告警 - Slack
- name: 'slack-info'
slack_configs:
- channel: '#alerts-info'
color: 'good'
title: 'ℹ️ Info: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
创建告警规则
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: application-alerts
namespace: monitoring
labels:
prometheus: kube-prometheus
spec:
groups:
# 应用健康告警
- name: application-health
interval: 30s
rules:
# Pod 重启频繁
- alert: PodRestartingFrequently
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 重启频繁"
description: "Pod 在过去 15 分钟内重启了 {{ $value }} 次"
# Pod 未就绪
- alert: PodNotReady
expr: |
sum by (namespace, pod) (kube_pod_status_phase{phase!="Running"}) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 未就绪"
description: "Pod 已经 10 分钟未进入 Running 状态"
# 容器 OOMKilled
- alert: ContainerOOMKilled
expr: |
sum by (namespace, pod, container) (kube_pod_container_status_terminated_reason{reason="OOMKilled"}) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "容器 {{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container }} 因 OOM 被杀死"
description: "需要增加内存限制或优化应用内存使用"
# 资源使用告警
- name: resource-usage
interval: 30s
rules:
# CPU 使用率高
- alert: HighCPUUsage
expr: |
sum by (namespace, pod) (rate(container_cpu_usage_seconds_total[5m])) /
sum by (namespace, pod) (container_spec_cpu_quota / container_spec_cpu_period) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} CPU 使用率超过 90%"
description: "当前 CPU 使用率: {{ $value | humanizePercentage }}"
# 内存使用率高
- alert: HighMemoryUsage
expr: |
sum by (namespace, pod) (container_memory_working_set_bytes) /
sum by (namespace, pod) (container_spec_memory_limit_bytes) > 0.9
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 内存使用率超过 90%"
description: "当前内存使用率: {{ $value | humanizePercentage }}"
# 磁盘使用率高
- alert: HighDiskUsage
expr: |
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) /
node_filesystem_size_bytes{mountpoint="/"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.instance }} 磁盘使用率超过 85%"
description: "当前磁盘使用率: {{ $value | humanizePercentage }}"
# API 性能告警
- name: api-performance
interval: 30s
rules:
# API 响应时间慢
- alert: SlowAPIResponse
expr: |
histogram_quantile(0.95,
sum by (le, namespace, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "服务 {{ $labels.namespace }}/{{ $labels.service }} API 响应慢"
description: "P95 响应时间: {{ $value }}s"
# API 错误率高
- alert: HighAPIErrorRate
expr: |
sum by (namespace, service) (rate(http_requests_total{code=~"5.."}[5m])) /
sum by (namespace, service) (rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "服务 {{ $labels.namespace }}/{{ $labels.service }} 错误率高"
description: "当前错误率: {{ $value | humanizePercentage }}"
# 数据库告警
- name: database-alerts
interval: 30s
rules:
# RDS CPU 高
- alert: RDSHighCPU
expr: |
aws_rds_cpuutilization_average > 80
for: 10m
labels:
severity: warning
annotations:
summary: "RDS {{ $labels.dbinstance_identifier }} CPU 使用率高"
description: "当前 CPU: {{ $value }}%"
# RDS 连接数高
- alert: RDSHighConnections
expr: |
aws_rds_database_connections_average > 800
for: 5m
labels:
severity: warning
annotations:
summary: "RDS {{ $labels.dbinstance_identifier }} 连接数过高"
description: "当前连接数: {{ $value }}"
# Redis 内存使用率高
- alert: RedisHighMemory
expr: |
redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis {{ $labels.instance }} 内存使用率高"
description: "当前使用率: {{ $value | humanizePercentage }}"
ServiceMonitor 示例
# user-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: user-service
namespace: production
labels:
app: user-service
spec:
selector:
matchLabels:
app: user-service
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
Grafana 仪表盘
自定义业务仪表盘
{
"dashboard": {
"title": "Business Metrics Dashboard",
"tags": ["business", "production"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "订单量(实时)",
"type": "graph",
"targets": [
{
"expr": "sum(rate(orders_total[5m]))",
"legendFormat": "每秒订单数"
}
],
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8}
},
{
"id": 2,
"title": "API 成功率",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total{code!~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"gridPos": {"x": 12, "y": 0, "w": 6, "h": 4}
},
{
"id": 3,
"title": "P95 响应时间",
"type": "stat",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
],
"gridPos": {"x": 18, "y": 0, "w": 6, "h": 4}
}
]
}
}
ELK Stack 部署
Elasticsearch 部署
# elasticsearch-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: monitoring
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: increase-vm-max-map
image: busybox
command: ["sysctl", "-w", "vm.max_map_count=262144"]
securityContext:
privileged: true
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
env:
- name: cluster.name
value: "production-logs"
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: ES_JAVA_OPTS
value: "-Xms2g -Xmx2g"
- name: xpack.security.enabled
value: "false"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
resources:
requests:
cpu: 1000m
memory: 3Gi
limits:
cpu: 2000m
memory: 4Gi
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3
resources:
requests:
storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: monitoring
spec:
selector:
app: elasticsearch
ports:
- port: 9200
name: http
- port: 9300
name: transport
clusterIP: None
Kibana 部署
# kibana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:8.11.0
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
- name: SERVER_NAME
value: "kibana"
- name: SERVER_HOST
value: "0.0.0.0"
ports:
- containerPort: 5601
name: http
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
readinessProbe:
httpGet:
path: /api/status
port: 5601
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: monitoring
spec:
selector:
app: kibana
ports:
- port: 5601
targetPort: 5601
type: ClusterIP
Filebeat 部署
# filebeat-daemonset.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat-config
namespace: monitoring
data:
filebeat.yml: |
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
output.elasticsearch:
hosts: ['elasticsearch:9200']
index: "filebeat-%{+yyyy.MM.dd}"
setup.template.name: "filebeat"
setup.template.pattern: "filebeat-*"
setup.ilm.enabled: false
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: filebeat
namespace: monitoring
spec:
selector:
matchLabels:
app: filebeat
template:
metadata:
labels:
app: filebeat
spec:
serviceAccountName: filebeat
terminationGracePeriodSeconds: 30
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: filebeat
image: docker.elastic.co/beats/filebeat:8.11.0
args: [
"-c", "/etc/filebeat.yml",
"-e",
]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 200m
memory: 400Mi
volumeMounts:
- name: config
mountPath: /etc/filebeat.yml
subPath: filebeat.yml
- name: data
mountPath: /usr/share/filebeat/data
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: config
configMap:
name: filebeat-config
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: varlog
hostPath:
path: /var/log
- name: data
hostPath:
path: /var/lib/filebeat-data
type: DirectoryOrCreate
Jaeger 分布式追踪
安装 Jaeger Operator
#!/bin/bash
# install-jaeger.sh
NAMESPACE="monitoring"
echo "安装 Jaeger Operator..."
# 安装 cert-manager(前置依赖)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# 安装 Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.50.0/jaeger-operator.yaml -n observability
echo "✓ Jaeger Operator 已安装"
Jaeger 实例配置
# jaeger-instance.yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger-production
namespace: monitoring
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
index-prefix: jaeger
esIndexCleaner:
enabled: true
numberOfDays: 7
schedule: "55 23 * * *"
collector:
replicas: 3
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
query:
replicas: 2
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
agent:
strategy: DaemonSet
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
最佳实践总结
1. Prometheus 配置
✓ 使用 Operator 管理
✓ 配置高可用(2+ 副本)
✓ 合理设置数据保留期
✓ 使用持久化存储
✓ 配置 ServiceMonitor 自动发现
✓ 分层设置告警规则
2. Grafana 配置
✓ 使用持久化存储
✓ 配置多数据源
✓ 创建分层仪表盘
✓ 启用告警通知
✓ 定期备份配置
✓ 使用变量提高复用性
3. Alertmanager 配置
✓ 配置多接收器
✓ 按严重性分组
✓ 避免告警风暴
✓ 配置抑制规则
✓ 集成 PagerDuty
✓ 定期测试告警
4. 日志管理
✓ 集中式日志收集
✓ 结构化日志格式
✓ 配置索引生命周期
✓ 设置日志保留策略
✓ 使用日志级别过滤
✓ 定期清理旧日志
5. 分布式追踪
✓ 在所有服务启用追踪
✓ 使用采样降低开销
✓ 配置上下文传播
✓ 设置合理的保留期
✓ 监控追踪系统性能
✓ 与日志和指标关联
下一步: 继续学习 自动扩展和容灾 章节。