实战案例和工具

本节通过真实案例演示故障排查过程，并介绍常用的调试工具。

实战案例

案例 1：应用突然无法访问

症状：

# 用户反馈应用无法访问
curl https://app.example.com
# curl: (7) Failed to connect to app.example.com port 443: Connection refused

排查过程：

# 1. 检查 Ingress
kubectl get ingress -n production
NAME      HOSTS              ADDRESS        PORTS
myapp     app.example.com    10.0.1.100    80, 443

# 2. 检查 Ingress Controller
kubectl get pods -n ingress-nginx
NAME                             READY   STATUS    RESTARTS
ingress-nginx-controller-abc     0/1     Running   5

# 发现 Controller 频繁重启！

# 3. 查看日志
kubectl logs -n ingress-nginx ingress-nginx-controller-abc --previous
# Error: Failed to list *v1.Endpoints: endpoints is forbidden

# 根因：RBAC 权限问题

# 4. 检查 ServiceAccount 权限
kubectl get clusterrolebinding | grep ingress-nginx
# 发现 ClusterRoleBinding 被误删除

# 5. 修复
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/cloud/deploy.yaml

# 6. 验证
kubectl get pods -n ingress-nginx
# 正常运行

curl https://app.example.com
# 恢复正常

总结：

根因：RBAC ClusterRoleBinding 被误删
影响：Ingress Controller 无法监听 Endpoints 变化
预防：使用 GitOps 管理基础设施，避免手动操作

案例 2：Pod 内存持续增长导致 OOM

症状：

# Pod 频繁重启
kubectl get pods
NAME           READY   STATUS      RESTARTS
api-server     0/1     OOMKilled   8

排查过程：

# 1. 查看资源限制
kubectl get pod api-server -o yaml | grep -A 5 resources
resources:
  limits:
    memory: 512Mi
  requests:
    memory: 256Mi

# 2. 查看历史内存使用
# 在 Grafana 中查询
container_memory_usage_bytes{pod="api-server"}

# 发现内存持续增长，最终达到 512Mi 被 OOM

# 3. 生成堆快照（Node.js 应用）
kubectl exec -it api-server -- kill -USR2 $(pidof node)

# 4. 复制堆快照
kubectl cp api-server:/tmp/heapdump-*.heapsnapshot ./heap.heapsnapshot

# 5. 使用 Chrome DevTools 分析
# 发现大量的 Socket 对象未释放

# 6. 检查代码
# 发现 HTTP 客户端连接未正确关闭

// 问题代码
app.get('/api/data', async (req, res) => {
  const response = await axios.get('http://external-api');
  res.json(response.data);
  // 缺少错误处理，连接池耗尽
});

// 修复后
const axiosInstance = axios.create({
  timeout: 5000,
  maxRedirects: 5,
  httpAgent: new http.Agent({ 
    keepAlive: true,
    maxSockets: 50  // 限制连接池大小
  })
});

app.get('/api/data', async (req, res) => {
  try {
    const response = await axiosInstance.get('http://external-api');
    res.json(response.data);
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

总结：

根因：HTTP 连接池泄漏
影响：内存持续增长直至 OOM
预防：
- 正确管理连接池
- 设置内存告警
- 定期进行内存分析

案例 3：网络间歇性超时

症状：

# 用户反馈 API 偶尔超时
# 监控显示 P99 延迟异常高

排查过程：

# 1. 查看 Service Endpoints
kubectl get endpoints api-service
NAME          ENDPOINTS
api-service   10.244.1.5:8080,10.244.2.8:8080,10.244.3.9:8080

# 3 个 Endpoints，看起来正常

# 2. 测试每个 Pod
for pod in $(kubectl get pods -l app=api -o name); do
  echo "Testing $pod"
  kubectl exec -it $pod -- time curl -s http://localhost:8080/health
done

# 发现其中一个 Pod 响应慢

# 3. 检查慢的 Pod
kubectl exec -it api-abc123 -- top
# CPU 使用率异常高

# 4. 查看该 Pod 的节点
kubectl get pod api-abc123 -o wide
NODE          
node-worker-2

# 5. 检查节点
kubectl describe node node-worker-2
# 发现节点上运行了大量 Pod，资源耗尽

# 6. 临时解决：驱逐慢的 Pod
kubectl delete pod api-abc123

# 新的 Pod 调度到其他节点，问题解决

# 7. 长期解决：配置 Pod 反亲和性

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api
              topologyKey: kubernetes.io/hostname

总结：

根因：Pod 分布不均，某个节点资源耗尽
影响：该节点上的 Pod 响应变慢
预防：
- 配置 Pod 反亲和性
- 配置资源配额
- 监控节点资源使用

案例 4：DNS 解析失败导致服务不可用

症状：

# 应用日志显示大量 DNS 错误
kubectl logs api-server
# Error: getaddrinfo ENOTFOUND database-service

排查过程：

# 1. 在 Pod 内测试 DNS
kubectl exec -it api-server -- nslookup database-service
# Server: 10.96.0.10
# ** server can't find database-service: NXDOMAIN

# 2. 检查 Service 是否存在
kubectl get svc database-service
# Error from server (NotFound): services "database-service" not found

# 发现 Service 不存在！

# 3. 检查 Deployment 历史
kubectl rollout history deployment api-server

# 发现最近有一次更新

# 4. 查看配置变更
kubectl get deployment api-server -o yaml | grep DATABASE_SERVICE
# DATABASE_SERVICE_HOST=database-service

# 配置正确，但 Service 确实不存在

# 5. 检查是否被误删
kubectl get events --sort-by='.lastTimestamp' | grep database-service
# 发现有删除事件

# 6. 从 Git 恢复 Service 配置
git checkout database-service.yaml
kubectl apply -f database-service.yaml

# 7. 验证
kubectl exec -it api-server -- nslookup database-service
# 正常解析

总结：

根因：Service 被误删除
影响：应用无法通过 DNS 访问数据库
预防：
- 使用 GitOps 管理配置
- 配置 RBAC 防止误删
- 定期备份集群配置

调试工具集

kubectl 插件

kubectl-debug

# 安装
kubectl krew install debug

# 使用临时调试容器
kubectl debug <pod-name> -it --image=nicolaka/netshoot

# 在节点上运行调试容器
kubectl debug node/<node-name> -it --image=ubuntu

kubectl-tree

# 查看资源树
kubectl krew install tree
kubectl tree deployment myapp

# 输出：
# NAMESPACE  NAME                            KIND
# default    Deployment/myapp                Deployment
# default    └─ReplicaSet/myapp-abc123       ReplicaSet
# default      ├─Pod/myapp-abc123-xyz        Pod
# default      └─Pod/myapp-abc123-qwe        Pod

kubectl-neat

# 清理输出（移除管理字段）
kubectl krew install neat
kubectl get pod myapp -o yaml | kubectl neat

stern

# 多 Pod 日志聚合
brew install stern

# 查看所有匹配的 Pod 日志
stern myapp

# 过滤
stern myapp --since 1h
stern myapp -n production
stern myapp --exclude-container istio-proxy

网络调试工具

netshoot 容器

# 运行调试容器
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# 包含工具：
# - curl, wget, httpie
# - nslookup, dig, host
# - ping, traceroute, mtr
# - tcpdump, nmap, netcat
# - iperf3, ab (Apache Bench)

使用示例：

# DNS 诊断
nslookup myapp-service
dig myapp-service.default.svc.cluster.local

# HTTP 测试
curl -v http://myapp-service:8080/health
httpie http://myapp-service:8080/api/data

# 端口扫描
nmap -p 1-65535 myapp-service

# 网络性能测试
iperf3 -c myapp-service -p 5001 -t 30

# 抓包分析
tcpdump -i eth0 -nn port 8080 -w capture.pcap

性能分析工具

kubectl-top

# 查看资源使用
kubectl top nodes
kubectl top pods --sort-by=memory
kubectl top pods --sort-by=cpu

# 持续监控
watch -n 1 kubectl top pods

kube-capacity

# 安装
kubectl krew install resource-capacity

# 查看集群容量
kubectl resource-capacity

# 按节点查看
kubectl resource-capacity --node-labels=node-role.kubernetes.io/worker

日志分析工具

kubetail

# 安装
brew tap johanhaleby/kubetail && brew install kubetail

# 实时查看多个 Pod 日志
kubetail myapp

# 带颜色区分
kubetail myapp --colored-output

k9s

# 安装
brew install k9s

# 启动（TUI 界面）
k9s

# 快捷键：
# :pod     - 查看 Pods
# :svc     - 查看 Services
# :deploy  - 查看 Deployments
# /        - 过滤
# l        - 查看日志
# d        - 描述资源
# e        - 编辑

集群分析工具

kubectl-doctor

# 健康检查
kubectl doctor

# 输出：
# ✓ Cluster is reachable
# ✓ API server is responding
# ✓ All nodes are ready
# ⚠ CoreDNS has 1 replica (recommended: 2)
# ⚠ High pod restart rate detected

Polaris

# 运行 Polaris 审计
kubectl apply -f https://github.com/FairwindsOps/polaris/releases/latest/download/dashboard.yaml

# 端口转发
kubectl port-forward -n polaris svc/polaris-dashboard 8080:80

# 访问 http://localhost:8080
# 查看最佳实践建议

kube-score

# 安装
brew install kube-score

# 分析 YAML 文件
kube-score score deployment.yaml

# 输出：
# apps/v1/Deployment myapp
# [CRITICAL] Container Resources
#     The container does not have a resource limit set
# [WARNING] Container Image Tag
#     The container is using the latest tag

故障排查清单

快速检查清单

# 1. 集群整体状态
kubectl get nodes
kubectl get componentstatuses

# 2. 核心组件
kubectl get pods -n kube-system

# 3. 应用状态
kubectl get pods --all-namespaces
kubectl get deployments --all-namespaces

# 4. 网络
kubectl get svc --all-namespaces
kubectl get ingress --all-namespaces

# 5. 存储
kubectl get pv
kubectl get pvc --all-namespaces

# 6. 事件
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50

# 7. 资源使用
kubectl top nodes
kubectl top pods --all-namespaces

详细排查脚本

#!/bin/bash
# k8s-troubleshoot.sh

echo "=== Kubernetes Cluster Health Check ==="

echo -e "\n1. Cluster Info"
kubectl cluster-info

echo -e "\n2. Node Status"
kubectl get nodes -o wide

echo -e "\n3. Unhealthy Pods"
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded

echo -e "\n4. Recent Events"
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

echo -e "\n5. Resource Usage"
kubectl top nodes
kubectl top pods --all-namespaces | sort -k3 -h | tail -10

echo -e "\n6. PVC Status"
kubectl get pvc --all-namespaces | grep -v Bound

echo -e "\n7. Service Endpoints"
kubectl get endpoints --all-namespaces | grep "<none>"

echo "=== Health Check Complete ==="

常用命令速查

# Pod 相关
kubectl get pods -o wide
kubectl describe pod <name>
kubectl logs <name> --tail=100 -f
kubectl exec -it <name> -- /bin/sh
kubectl delete pod <name> --force --grace-period=0

# Deployment 相关
kubectl get deployments
kubectl rollout status deployment/<name>
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name>
kubectl scale deployment/<name> --replicas=5

# Service 相关
kubectl get svc
kubectl get endpoints <svc-name>
kubectl describe svc <name>

# 调试
kubectl run debug --rm -it --image=busybox -- sh
kubectl port-forward <pod-name> 8080:8080
kubectl proxy --port=8001

# 资源
kubectl top nodes
kubectl top pods
kubectl describe node <name>

# 事件
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

小结

本节通过实战案例和工具介绍了故障排查：

✅ 实战案例：应用不可用、OOM、网络超时、DNS 失败
✅ kubectl 插件：debug、tree、neat、stern
✅ 网络工具：netshoot、tcpdump、iperf3
✅ 性能工具：top、k9s、kube-capacity
✅ 分析工具：Polaris、kube-score
✅ 排查清单：快速检查和详细脚本

至此，故障排查章节全部完成！🎉

下一章：安全最佳实践。