调度策略 - 亲和性与污点

Pod 调度流程

Kubernetes 调度器选择节点的过程：

1. 过滤（Filtering）：排除不符合条件的节点
2. 打分（Scoring）：对剩余节点打分
3. 选择（Selection）：选择得分最高的节点

NodeSelector - 节点选择器

最简单的节点选择方式。

给节点打标签

# 添加标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-3 gpu=nvidia

# 查看标签
kubectl get nodes --show-labels

# 删除标签
kubectl label nodes node-1 disktype-

使用 NodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    disktype: ssd  # 只调度到有 ssd 标签的节点
  containers:
  - name: nginx
    image: nginx

NodeAffinity - 节点亲和性

比 NodeSelector 更灵活的节点选择方式。

硬性要求

必须满足条件，类似 NodeSelector：

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: nginx
    image: nginx

软性偏好

优先选择，但不强制：

apiVersion: v1
kind: Pod
metadata:
  name: with-node-preference
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-west-1a
  containers:
  - name: nginx
    image: nginx

操作符

In：标签值在列表中
NotIn：标签值不在列表中
Exists：标签存在
DoesNotExist：标签不存在
Gt：标签值大于（数值）
Lt：标签值小于（数值）

matchExpressions:
- key: node.kubernetes.io/instance-type
  operator: In
  values:
  - m5.large
  - m5.xlarge
- key: failure-domain.beta.kubernetes.io/zone
  operator: NotIn
  values:
  - us-west-1c

PodAffinity - Pod 亲和性

根据其他 Pod 的位置调度。

Pod 亲和性

希望 Pod 调度到一起：

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx

含义：调度到运行 redis Pod 的同一节点上。

Pod 反亲和性

希望 Pod 分散部署：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: webapp
            topologyKey: kubernetes.io/hostname
      containers:
      - name: webapp
        image: webapp:v1

效果：每个节点最多运行一个 webapp Pod，实现高可用。

topologyKey

定义"拓扑域"，常用值：

kubernetes.io/hostname：节点级别
topology.kubernetes.io/zone：可用区级别
topology.kubernetes.io/region：区域级别

# 不同节点
topologyKey: kubernetes.io/hostname

# 不同可用区
topologyKey: topology.kubernetes.io/zone

# 不同区域
topologyKey: topology.kubernetes.io/region

Taints 和 Tolerations - 污点与容忍

污点（Taint）让节点排斥某些 Pod，除非 Pod 有对应的容忍（Toleration）。

添加污点

# 添加污点
kubectl taint nodes node-1 key=value:NoSchedule

# 查看污点
kubectl describe node node-1 | grep Taints

# 删除污点
kubectl taint nodes node-1 key:NoSchedule-

污点效果

NoSchedule：不调度新 Pod（已有 Pod 不受影响）

kubectl taint nodes node-1 gpu=true:NoSchedule

PreferNoSchedule：尽量不调度（软限制）

kubectl taint nodes node-1 gpu=true:PreferNoSchedule

NoExecute：驱逐已有 Pod

kubectl taint nodes node-1 maintenance=true:NoExecute

添加容忍

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: tensorflow
    image: tensorflow/tensorflow:latest-gpu

容忍所有污点

tolerations:
- operator: "Exists"  # 容忍所有污点

容忍特定时间

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # 300秒后才驱逐

实战场景

场景 1：GPU 节点专用

# 给 GPU 节点打标签和污点
kubectl label nodes gpu-node-1 gpu=nvidia
kubectl taint nodes gpu-node-1 gpu=nvidia:NoSchedule

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      nodeSelector:
        gpu: nvidia
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "nvidia"
        effect: "NoSchedule"
      containers:
      - name: trainer
        image: tensorflow/tensorflow:latest-gpu
        resources:
          limits:
            nvidia.com/gpu: 1

场景 2：高可用部署

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical-app
  template:
    metadata:
      labels:
        app: critical-app
    spec:
      affinity:
        # Pod 反亲和：分散到不同节点
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: critical-app
            topologyKey: kubernetes.io/hostname
        # Node 亲和：优先选择 SSD 节点
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
      containers:
      - name: app
        image: critical-app:v1

场景 3：数据库与应用同节点

# Redis
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
        cache: redis
    spec:
      containers:
      - name: redis
        image: redis:7.0
---
# 应用（与 Redis 同节点）
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  cache: redis
              topologyKey: kubernetes.io/hostname
      containers:
      - name: webapp
        image: webapp:v1

场景 4：维护节点

# 标记节点维护
kubectl taint nodes node-1 maintenance=true:NoSchedule

# 驱逐现有 Pod
kubectl taint nodes node-1 maintenance=true:NoExecute

# 排空节点（更优雅）
kubectl drain node-1 --ignore-daemonsets

# 维护完成后恢复
kubectl uncordon node-1
kubectl taint nodes node-1 maintenance:NoSchedule-
kubectl taint nodes node-1 maintenance:NoExecute-

Priority Class - 优先级

定义 Pod 的优先级，高优先级 Pod 可以抢占低优先级 Pod。

创建 PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "高优先级应用"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "低优先级应用"

使用 PriorityClass

apiVersion: v1
kind: Pod
metadata:
  name: critical-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: critical-app:v1

效果：

资源不足时，低优先级 Pod 会被驱逐
高优先级 Pod 优先调度

自定义调度器

创建自定义调度逻辑：

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduler-pod
spec:
  schedulerName: my-scheduler  # 使用自定义调度器
  containers:
  - name: nginx
    image: nginx

常用命令

# 节点标签
kubectl label nodes <node> key=value
kubectl get nodes --show-labels

# 污点
kubectl taint nodes <node> key=value:effect
kubectl describe node <node> | grep Taints

# 查看 Pod 调度
kubectl get pods -o wide
kubectl describe pod <pod-name>

# 排空节点
kubectl drain <node> --ignore-daemonsets
kubectl uncordon <node>

# PriorityClass
kubectl get priorityclass
kubectl describe priorityclass <name>

最佳实践

1. 合理使用 NodeSelector

简单场景优先使用 NodeSelector：

nodeSelector:
  disktype: ssd

2. 高可用部署使用反亲和性

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - labelSelector:
      matchLabels:
        app: webapp
    topologyKey: kubernetes.io/hostname

3. 专用节点使用污点

kubectl taint nodes gpu-node gpu=true:NoSchedule

4. 关键应用设置优先级

priorityClassName: high-priority

5. 维护节点使用 drain

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

6. 测试调度策略

# 创建测试 Pod
kubectl run test --image=nginx --dry-run=client -o yaml > test.yaml

# 添加调度策略后应用
kubectl apply -f test.yaml

# 查看调度结果
kubectl get pod test -o wide
kubectl describe pod test

小结

Kubernetes 提供了丰富的调度策略：

节点选择：

NodeSelector：简单标签匹配
NodeAffinity：灵活的节点选择

Pod 调度：

PodAffinity：Pod 聚合
PodAntiAffinity：Pod 分散

节点管理：

Taints：节点排斥
Tolerations：容忍污点

优先级：

PriorityClass：抢占式调度

使用场景：

GPU 节点专用
高可用部署
数据局部性
节点维护

下一章我们将学习 HPA 自动扩缩容。