调度策略 - 亲和性与污点

调度策略 - 亲和性与污点

Pod 调度流程

Kubernetes 调度器选择节点的过程:

1. 过滤(Filtering):排除不符合条件的节点
2. 打分(Scoring):对剩余节点打分
3. 选择(Selection):选择得分最高的节点

NodeSelector - 节点选择器

最简单的节点选择方式。

给节点打标签

# 添加标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-3 gpu=nvidia

# 查看标签
kubectl get nodes --show-labels

# 删除标签
kubectl label nodes node-1 disktype-

使用 NodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  nodeSelector:
    disktype: ssd  # 只调度到有 ssd 标签的节点
  containers:
  - name: nginx
    image: nginx

NodeAffinity - 节点亲和性

比 NodeSelector 更灵活的节点选择方式。

硬性要求

必须满足条件,类似 NodeSelector:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
          - key: disktype
            operator: In
            values:
            - ssd
  containers:
  - name: nginx
    image: nginx

软性偏好

优先选择,但不强制:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-preference
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
      - weight: 20
        preference:
          matchExpressions:
          - key: zone
            operator: In
            values:
            - us-west-1a
  containers:
  - name: nginx
    image: nginx

操作符

  • In:标签值在列表中
  • NotIn:标签值不在列表中
  • Exists:标签存在
  • DoesNotExist:标签不存在
  • Gt:标签值大于(数值)
  • Lt:标签值小于(数值)
matchExpressions:
- key: node.kubernetes.io/instance-type
  operator: In
  values:
  - m5.large
  - m5.xlarge
- key: failure-domain.beta.kubernetes.io/zone
  operator: NotIn
  values:
  - us-west-1c

PodAffinity - Pod 亲和性

根据其他 Pod 的位置调度。

Pod 亲和性

希望 Pod 调度到一起:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis
        topologyKey: kubernetes.io/hostname
  containers:
  - name: nginx
    image: nginx

含义:调度到运行 redis Pod 的同一节点上。

Pod 反亲和性

希望 Pod 分散部署:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: webapp
            topologyKey: kubernetes.io/hostname
      containers:
      - name: webapp
        image: webapp:v1

效果:每个节点最多运行一个 webapp Pod,实现高可用。

topologyKey

定义"拓扑域",常用值:

  • kubernetes.io/hostname:节点级别
  • topology.kubernetes.io/zone:可用区级别
  • topology.kubernetes.io/region:区域级别
# 不同节点
topologyKey: kubernetes.io/hostname

# 不同可用区
topologyKey: topology.kubernetes.io/zone

# 不同区域
topologyKey: topology.kubernetes.io/region

Taints 和 Tolerations - 污点与容忍

污点(Taint)让节点排斥某些 Pod,除非 Pod 有对应的容忍(Toleration)。

添加污点

# 添加污点
kubectl taint nodes node-1 key=value:NoSchedule

# 查看污点
kubectl describe node node-1 | grep Taints

# 删除污点
kubectl taint nodes node-1 key:NoSchedule-

污点效果

NoSchedule:不调度新 Pod(已有 Pod 不受影响)

kubectl taint nodes node-1 gpu=true:NoSchedule

PreferNoSchedule:尽量不调度(软限制)

kubectl taint nodes node-1 gpu=true:PreferNoSchedule

NoExecute:驱逐已有 Pod

kubectl taint nodes node-1 maintenance=true:NoExecute

添加容忍

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
  - key: "gpu"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  containers:
  - name: tensorflow
    image: tensorflow/tensorflow:latest-gpu

容忍所有污点

tolerations:
- operator: "Exists"  # 容忍所有污点

容忍特定时间

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300  # 300秒后才驱逐

实战场景

场景 1:GPU 节点专用

# 给 GPU 节点打标签和污点
kubectl label nodes gpu-node-1 gpu=nvidia
kubectl taint nodes gpu-node-1 gpu=nvidia:NoSchedule
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      nodeSelector:
        gpu: nvidia
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "nvidia"
        effect: "NoSchedule"
      containers:
      - name: trainer
        image: tensorflow/tensorflow:latest-gpu
        resources:
          limits:
            nvidia.com/gpu: 1

场景 2:高可用部署

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical-app
  template:
    metadata:
      labels:
        app: critical-app
    spec:
      affinity:
        # Pod 反亲和:分散到不同节点
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: critical-app
            topologyKey: kubernetes.io/hostname
        # Node 亲和:优先选择 SSD 节点
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
      containers:
      - name: app
        image: critical-app:v1

场景 3:数据库与应用同节点

# Redis
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
        cache: redis
    spec:
      containers:
      - name: redis
        image: redis:7.0
---
# 应用(与 Redis 同节点)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  cache: redis
              topologyKey: kubernetes.io/hostname
      containers:
      - name: webapp
        image: webapp:v1

场景 4:维护节点

# 标记节点维护
kubectl taint nodes node-1 maintenance=true:NoSchedule

# 驱逐现有 Pod
kubectl taint nodes node-1 maintenance=true:NoExecute

# 排空节点(更优雅)
kubectl drain node-1 --ignore-daemonsets

# 维护完成后恢复
kubectl uncordon node-1
kubectl taint nodes node-1 maintenance:NoSchedule-
kubectl taint nodes node-1 maintenance:NoExecute-

Priority Class - 优先级

定义 Pod 的优先级,高优先级 Pod 可以抢占低优先级 Pod。

创建 PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "高优先级应用"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 100
globalDefault: false
description: "低优先级应用"

使用 PriorityClass

apiVersion: v1
kind: Pod
metadata:
  name: critical-pod
spec:
  priorityClassName: high-priority
  containers:
  - name: app
    image: critical-app:v1

效果

  • 资源不足时,低优先级 Pod 会被驱逐
  • 高优先级 Pod 优先调度

自定义调度器

创建自定义调度逻辑:

apiVersion: v1
kind: Pod
metadata:
  name: custom-scheduler-pod
spec:
  schedulerName: my-scheduler  # 使用自定义调度器
  containers:
  - name: nginx
    image: nginx

常用命令

# 节点标签
kubectl label nodes <node> key=value
kubectl get nodes --show-labels

# 污点
kubectl taint nodes <node> key=value:effect
kubectl describe node <node> | grep Taints

# 查看 Pod 调度
kubectl get pods -o wide
kubectl describe pod <pod-name>

# 排空节点
kubectl drain <node> --ignore-daemonsets
kubectl uncordon <node>

# PriorityClass
kubectl get priorityclass
kubectl describe priorityclass <name>

最佳实践

1. 合理使用 NodeSelector

简单场景优先使用 NodeSelector:

nodeSelector:
  disktype: ssd

2. 高可用部署使用反亲和性

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - labelSelector:
      matchLabels:
        app: webapp
    topologyKey: kubernetes.io/hostname

3. 专用节点使用污点

kubectl taint nodes gpu-node gpu=true:NoSchedule

4. 关键应用设置优先级

priorityClassName: high-priority

5. 维护节点使用 drain

kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data

6. 测试调度策略

# 创建测试 Pod
kubectl run test --image=nginx --dry-run=client -o yaml > test.yaml

# 添加调度策略后应用
kubectl apply -f test.yaml

# 查看调度结果
kubectl get pod test -o wide
kubectl describe pod test

小结

Kubernetes 提供了丰富的调度策略:

节点选择

  • NodeSelector:简单标签匹配
  • NodeAffinity:灵活的节点选择

Pod 调度

  • PodAffinity:Pod 聚合
  • PodAntiAffinity:Pod 分散

节点管理

  • Taints:节点排斥
  • Tolerations:容忍污点

优先级

  • PriorityClass:抢占式调度

使用场景

  • GPU 节点专用
  • 高可用部署
  • 数据局部性
  • 节点维护

下一章我们将学习 HPA 自动扩缩容。