调度策略 - 亲和性与污点
调度策略 - 亲和性与污点
Pod 调度流程
Kubernetes 调度器选择节点的过程:
1. 过滤(Filtering):排除不符合条件的节点
2. 打分(Scoring):对剩余节点打分
3. 选择(Selection):选择得分最高的节点
NodeSelector - 节点选择器
最简单的节点选择方式。
给节点打标签
# 添加标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-2 disktype=hdd
kubectl label nodes node-3 gpu=nvidia
# 查看标签
kubectl get nodes --show-labels
# 删除标签
kubectl label nodes node-1 disktype-
使用 NodeSelector
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeSelector:
disktype: ssd # 只调度到有 ssd 标签的节点
containers:
- name: nginx
image: nginx
NodeAffinity - 节点亲和性
比 NodeSelector 更灵活的节点选择方式。
硬性要求
必须满足条件,类似 NodeSelector:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
软性偏好
优先选择,但不强制:
apiVersion: v1
kind: Pod
metadata:
name: with-node-preference
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20
preference:
matchExpressions:
- key: zone
operator: In
values:
- us-west-1a
containers:
- name: nginx
image: nginx
操作符
- In:标签值在列表中
- NotIn:标签值不在列表中
- Exists:标签存在
- DoesNotExist:标签不存在
- Gt:标签值大于(数值)
- Lt:标签值小于(数值)
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.large
- m5.xlarge
- key: failure-domain.beta.kubernetes.io/zone
operator: NotIn
values:
- us-west-1c
PodAffinity - Pod 亲和性
根据其他 Pod 的位置调度。
Pod 亲和性
希望 Pod 调度到一起:
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
containers:
- name: nginx
image: nginx
含义:调度到运行 redis Pod 的同一节点上。
Pod 反亲和性
希望 Pod 分散部署:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname
containers:
- name: webapp
image: webapp:v1
效果:每个节点最多运行一个 webapp Pod,实现高可用。
topologyKey
定义"拓扑域",常用值:
kubernetes.io/hostname:节点级别topology.kubernetes.io/zone:可用区级别topology.kubernetes.io/region:区域级别
# 不同节点
topologyKey: kubernetes.io/hostname
# 不同可用区
topologyKey: topology.kubernetes.io/zone
# 不同区域
topologyKey: topology.kubernetes.io/region
Taints 和 Tolerations - 污点与容忍
污点(Taint)让节点排斥某些 Pod,除非 Pod 有对应的容忍(Toleration)。
添加污点
# 添加污点
kubectl taint nodes node-1 key=value:NoSchedule
# 查看污点
kubectl describe node node-1 | grep Taints
# 删除污点
kubectl taint nodes node-1 key:NoSchedule-
污点效果
NoSchedule:不调度新 Pod(已有 Pod 不受影响)
kubectl taint nodes node-1 gpu=true:NoSchedule
PreferNoSchedule:尽量不调度(软限制)
kubectl taint nodes node-1 gpu=true:PreferNoSchedule
NoExecute:驱逐已有 Pod
kubectl taint nodes node-1 maintenance=true:NoExecute
添加容忍
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
容忍所有污点
tolerations:
- operator: "Exists" # 容忍所有污点
容忍特定时间
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300 # 300秒后才驱逐
实战场景
场景 1:GPU 节点专用
# 给 GPU 节点打标签和污点
kubectl label nodes gpu-node-1 gpu=nvidia
kubectl taint nodes gpu-node-1 gpu=nvidia:NoSchedule
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
nodeSelector:
gpu: nvidia
tolerations:
- key: "gpu"
operator: "Equal"
value: "nvidia"
effect: "NoSchedule"
containers:
- name: trainer
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
场景 2:高可用部署
apiVersion: apps/v1
kind: Deployment
metadata:
name: critical-app
spec:
replicas: 3
selector:
matchLabels:
app: critical-app
template:
metadata:
labels:
app: critical-app
spec:
affinity:
# Pod 反亲和:分散到不同节点
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: critical-app
topologyKey: kubernetes.io/hostname
# Node 亲和:优先选择 SSD 节点
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: app
image: critical-app:v1
场景 3:数据库与应用同节点
# Redis
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
cache: redis
spec:
containers:
- name: redis
image: redis:7.0
---
# 应用(与 Redis 同节点)
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
cache: redis
topologyKey: kubernetes.io/hostname
containers:
- name: webapp
image: webapp:v1
场景 4:维护节点
# 标记节点维护
kubectl taint nodes node-1 maintenance=true:NoSchedule
# 驱逐现有 Pod
kubectl taint nodes node-1 maintenance=true:NoExecute
# 排空节点(更优雅)
kubectl drain node-1 --ignore-daemonsets
# 维护完成后恢复
kubectl uncordon node-1
kubectl taint nodes node-1 maintenance:NoSchedule-
kubectl taint nodes node-1 maintenance:NoExecute-
Priority Class - 优先级
定义 Pod 的优先级,高优先级 Pod 可以抢占低优先级 Pod。
创建 PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "高优先级应用"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 100
globalDefault: false
description: "低优先级应用"
使用 PriorityClass
apiVersion: v1
kind: Pod
metadata:
name: critical-pod
spec:
priorityClassName: high-priority
containers:
- name: app
image: critical-app:v1
效果:
- 资源不足时,低优先级 Pod 会被驱逐
- 高优先级 Pod 优先调度
自定义调度器
创建自定义调度逻辑:
apiVersion: v1
kind: Pod
metadata:
name: custom-scheduler-pod
spec:
schedulerName: my-scheduler # 使用自定义调度器
containers:
- name: nginx
image: nginx
常用命令
# 节点标签
kubectl label nodes <node> key=value
kubectl get nodes --show-labels
# 污点
kubectl taint nodes <node> key=value:effect
kubectl describe node <node> | grep Taints
# 查看 Pod 调度
kubectl get pods -o wide
kubectl describe pod <pod-name>
# 排空节点
kubectl drain <node> --ignore-daemonsets
kubectl uncordon <node>
# PriorityClass
kubectl get priorityclass
kubectl describe priorityclass <name>
最佳实践
1. 合理使用 NodeSelector
简单场景优先使用 NodeSelector:
nodeSelector:
disktype: ssd
2. 高可用部署使用反亲和性
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: webapp
topologyKey: kubernetes.io/hostname
3. 专用节点使用污点
kubectl taint nodes gpu-node gpu=true:NoSchedule
4. 关键应用设置优先级
priorityClassName: high-priority
5. 维护节点使用 drain
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data
6. 测试调度策略
# 创建测试 Pod
kubectl run test --image=nginx --dry-run=client -o yaml > test.yaml
# 添加调度策略后应用
kubectl apply -f test.yaml
# 查看调度结果
kubectl get pod test -o wide
kubectl describe pod test
小结
Kubernetes 提供了丰富的调度策略:
节点选择:
- NodeSelector:简单标签匹配
- NodeAffinity:灵活的节点选择
Pod 调度:
- PodAffinity:Pod 聚合
- PodAntiAffinity:Pod 分散
节点管理:
- Taints:节点排斥
- Tolerations:容忍污点
优先级:
- PriorityClass:抢占式调度
使用场景:
- GPU 节点专用
- 高可用部署
- 数据局部性
- 节点维护
下一章我们将学习 HPA 自动扩缩容。