EKS 集群创建和配置

集群架构规划

设计原则

生产级 EKS 集群架构:

架构要点:
├─ 多可用区部署(至少 3 个 AZ)
├─ 私有子网运行工作节点
├─ 公有子网部署 NAT Gateway 和负载均衡器
├─ VPC Endpoints 优化成本
├─ 合理的 IP 地址规划
└─ 安全的网络隔离

高可用设计:
├─ 控制平面:AWS 自动多 AZ
├─ 工作节点:跨 3 个 AZ 均匀分布
├─ NAT Gateway:每个 AZ 独立部署
├─ 负载均衡器:多 AZ 分布
└─ 数据持久化:EBS 多 AZ 复制

VPC 和子网规划

推荐的 VPC 设计:

VPC CIDR:10.0.0.0/16

公有子网(用于 NAT Gateway、ALB/NLB):
├─ us-east-1a:10.0.1.0/24  (251 个可用 IP)
├─ us-east-1b:10.0.2.0/24  (251 个可用 IP)
└─ us-east-1c:10.0.3.0/24  (251 个可用 IP)

私有子网(用于 EKS 节点和 Pod):
├─ us-east-1a:10.0.11.0/24 (251 个可用 IP)
├─ us-east-1b:10.0.12.0/24 (251 个可用 IP)
└─ us-east-1c:10.0.13.0/24 (251 个可用 IP)

为什么使用 /24 子网:
├─ 每个子网 251 个可用 IP(AWS 保留 5 个)
├─ t3.large 最多 35 个 Pod
├─ 假设每个 AZ 5 个节点:5 × 35 = 175 个 IP
├─ 加上节点自身 IP:175 + 5 = 180 个 IP
├─ /24 子网足够使用,还有 30% 余量
└─ 可根据实际需求调整为 /23 或 /22

实战:创建 VPC 和子网

创建 VPC 网络

#!/bin/bash
# create-eks-vpc.sh

set -e

REGION="us-east-1"
VPC_CIDR="10.0.0.0/16"
CLUSTER_NAME="production-eks-cluster"

echo "================================================"
echo "创建 EKS VPC 和子网"
echo "================================================"

# 1. 创建 VPC
echo "1. 创建 VPC..."
VPC_ID=$(aws ec2 create-vpc \
  --cidr-block $VPC_CIDR \
  --enable-dns-support \
  --enable-dns-hostnames \
  --tag-specifications "ResourceType=vpc,Tags=[
    {Key=Name,Value=eks-vpc},
    {Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared}
  ]" \
  --region $REGION \
  --query 'Vpc.VpcId' \
  --output text)

echo "   ✓ VPC ID: $VPC_ID"

# 2. 创建 Internet Gateway
echo ""
echo "2. 创建 Internet Gateway..."
IGW_ID=$(aws ec2 create-internet-gateway \
  --tag-specifications "ResourceType=internet-gateway,Tags=[
    {Key=Name,Value=eks-igw}
  ]" \
  --region $REGION \
  --query 'InternetGateway.InternetGatewayId' \
  --output text)

aws ec2 attach-internet-gateway \
  --vpc-id $VPC_ID \
  --internet-gateway-id $IGW_ID \
  --region $REGION

echo "   ✓ IGW ID: $IGW_ID"

# 3. 创建公有子网
echo ""
echo "3. 创建公有子网..."
declare -a PUBLIC_SUBNET_IDS

AZS=("us-east-1a" "us-east-1b" "us-east-1c")
PUBLIC_CIDRS=("10.0.1.0/24" "10.0.2.0/24" "10.0.3.0/24")

for i in {0..2}; do
  SUBNET_ID=$(aws ec2 create-subnet \
    --vpc-id $VPC_ID \
    --cidr-block ${PUBLIC_CIDRS[$i]} \
    --availability-zone ${AZS[$i]} \
    --tag-specifications "ResourceType=subnet,Tags=[
      {Key=Name,Value=eks-public-${AZS[$i]}},
      {Key=kubernetes.io/role/elb,Value=1},
      {Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared}
    ]" \
    --region $REGION \
    --query 'Subnet.SubnetId' \
    --output text)
  
  # 启用公有 IP 自动分配
  aws ec2 modify-subnet-attribute \
    --subnet-id $SUBNET_ID \
    --map-public-ip-on-launch \
    --region $REGION
  
  PUBLIC_SUBNET_IDS[$i]=$SUBNET_ID
  echo "   ✓ 公有子网 ${AZS[$i]}: $SUBNET_ID"
done

# 4. 创建私有子网
echo ""
echo "4. 创建私有子网..."
declare -a PRIVATE_SUBNET_IDS

PRIVATE_CIDRS=("10.0.11.0/24" "10.0.12.0/24" "10.0.13.0/24")

for i in {0..2}; do
  SUBNET_ID=$(aws ec2 create-subnet \
    --vpc-id $VPC_ID \
    --cidr-block ${PRIVATE_CIDRS[$i]} \
    --availability-zone ${AZS[$i]} \
    --tag-specifications "ResourceType=subnet,Tags=[
      {Key=Name,Value=eks-private-${AZS[$i]}},
      {Key=kubernetes.io/role/internal-elb,Value=1},
      {Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=shared}
    ]" \
    --region $REGION \
    --query 'Subnet.SubnetId' \
    --output text)
  
  PRIVATE_SUBNET_IDS[$i]=$SUBNET_ID
  echo "   ✓ 私有子网 ${AZS[$i]}: $SUBNET_ID"
done

# 5. 创建公有路由表
echo ""
echo "5. 创建公有路由表..."
PUBLIC_RTB_ID=$(aws ec2 create-route-table \
  --vpc-id $VPC_ID \
  --tag-specifications "ResourceType=route-table,Tags=[
    {Key=Name,Value=eks-public-rtb}
  ]" \
  --region $REGION \
  --query 'RouteTable.RouteTableId' \
  --output text)

# 添加 Internet Gateway 路由
aws ec2 create-route \
  --route-table-id $PUBLIC_RTB_ID \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id $IGW_ID \
  --region $REGION

# 关联公有子网
for subnet_id in "${PUBLIC_SUBNET_IDS[@]}"; do
  aws ec2 associate-route-table \
    --route-table-id $PUBLIC_RTB_ID \
    --subnet-id $subnet_id \
    --region $REGION
done

echo "   ✓ 公有路由表: $PUBLIC_RTB_ID"

# 6. 创建 NAT Gateway(每个 AZ 一个)
echo ""
echo "6. 创建 NAT Gateway..."
declare -a NAT_GW_IDS

for i in {0..2}; do
  # 分配 Elastic IP
  EIP_ALLOC_ID=$(aws ec2 allocate-address \
    --domain vpc \
    --tag-specifications "ResourceType=elastic-ip,Tags=[
      {Key=Name,Value=eks-nat-${AZS[$i]}}
    ]" \
    --region $REGION \
    --query 'AllocationId' \
    --output text)
  
  # 创建 NAT Gateway
  NAT_GW_ID=$(aws ec2 create-nat-gateway \
    --subnet-id ${PUBLIC_SUBNET_IDS[$i]} \
    --allocation-id $EIP_ALLOC_ID \
    --tag-specifications "ResourceType=natgateway,Tags=[
      {Key=Name,Value=eks-nat-${AZS[$i]}}
    ]" \
    --region $REGION \
    --query 'NatGateway.NatGatewayId' \
    --output text)
  
  NAT_GW_IDS[$i]=$NAT_GW_ID
  echo "   ✓ NAT Gateway ${AZS[$i]}: $NAT_GW_ID"
done

# 等待 NAT Gateway 可用
echo ""
echo "   等待 NAT Gateway 创建完成..."
for nat_gw_id in "${NAT_GW_IDS[@]}"; do
  aws ec2 wait nat-gateway-available \
    --nat-gateway-ids $nat_gw_id \
    --region $REGION &
done
wait
echo "   ✓ 所有 NAT Gateway 已就绪"

# 7. 创建私有路由表(每个 AZ 一个)
echo ""
echo "7. 创建私有路由表..."
for i in {0..2}; do
  PRIVATE_RTB_ID=$(aws ec2 create-route-table \
    --vpc-id $VPC_ID \
    --tag-specifications "ResourceType=route-table,Tags=[
      {Key=Name,Value=eks-private-rtb-${AZS[$i]}}
    ]" \
    --region $REGION \
    --query 'RouteTable.RouteTableId' \
    --output text)
  
  # 添加 NAT Gateway 路由
  aws ec2 create-route \
    --route-table-id $PRIVATE_RTB_ID \
    --destination-cidr-block 0.0.0.0/0 \
    --nat-gateway-id ${NAT_GW_IDS[$i]} \
    --region $REGION
  
  # 关联私有子网
  aws ec2 associate-route-table \
    --route-table-id $PRIVATE_RTB_ID \
    --subnet-id ${PRIVATE_SUBNET_IDS[$i]} \
    --region $REGION
  
  echo "   ✓ 私有路由表 ${AZS[$i]}: $PRIVATE_RTB_ID"
done

# 保存配置到文件
cat > eks-vpc-config.env << EOF
VPC_ID=$VPC_ID
IGW_ID=$IGW_ID
PUBLIC_SUBNET_IDS=(${PUBLIC_SUBNET_IDS[@]})
PRIVATE_SUBNET_IDS=(${PRIVATE_SUBNET_IDS[@]})
NAT_GW_IDS=(${NAT_GW_IDS[@]})
CLUSTER_NAME=$CLUSTER_NAME
REGION=$REGION
EOF

echo ""
echo "================================================"
echo "VPC 创建完成!"
echo "================================================"
echo "VPC ID: $VPC_ID"
echo ""
echo "公有子网:"
for i in {0..2}; do
  echo "  ${AZS[$i]}: ${PUBLIC_SUBNET_IDS[$i]}"
done
echo ""
echo "私有子网:"
for i in {0..2}; do
  echo "  ${AZS[$i]}: ${PRIVATE_SUBNET_IDS[$i]}"
done
echo ""
echo "NAT Gateway:"
for i in {0..2}; do
  echo "  ${AZS[$i]}: ${NAT_GW_IDS[$i]}"
done
echo "================================================"

IAM 角色配置

EKS 集群角色

#!/bin/bash
# create-eks-cluster-role.sh

set -e

echo "================================================"
echo "创建 EKS 集群 IAM 角色"
echo "================================================"

# 1. 创建信任策略
cat > eks-cluster-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "eks.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

# 2. 创建 IAM 角色
EKS_CLUSTER_ROLE_ARN=$(aws iam create-role \
  --role-name eks-cluster-role \
  --assume-role-policy-document file://eks-cluster-trust-policy.json \
  --description "IAM role for EKS cluster" \
  --tags Key=Purpose,Value=EKS \
  --query 'Role.Arn' \
  --output text 2>/dev/null || \
  aws iam get-role --role-name eks-cluster-role --query 'Role.Arn' --output text)

echo "   ✓ Cluster Role ARN: $EKS_CLUSTER_ROLE_ARN"

# 3. 附加必需的策略
echo "   附加策略..."
aws iam attach-role-policy \
  --role-name eks-cluster-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

aws iam attach-role-policy \
  --role-name eks-cluster-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController

rm -f eks-cluster-trust-policy.json

# 保存到配置文件
echo "EKS_CLUSTER_ROLE_ARN=$EKS_CLUSTER_ROLE_ARN" >> eks-vpc-config.env

echo ""
echo "================================================"
echo "EKS 集群角色创建完成!"
echo "Role ARN: $EKS_CLUSTER_ROLE_ARN"
echo "================================================"

EKS 节点角色

#!/bin/bash
# create-eks-node-role.sh

set -e

echo "================================================"
echo "创建 EKS 节点 IAM 角色"
echo "================================================"

# 1. 创建信任策略
cat > eks-node-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ec2.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}
EOF

# 2. 创建 IAM 角色
EKS_NODE_ROLE_ARN=$(aws iam create-role \
  --role-name eks-node-role \
  --assume-role-policy-document file://eks-node-trust-policy.json \
  --description "IAM role for EKS worker nodes" \
  --tags Key=Purpose,Value=EKS \
  --query 'Role.Arn' \
  --output text 2>/dev/null || \
  aws iam get-role --role-name eks-node-role --query 'Role.Arn' --output text)

echo "   ✓ Node Role ARN: $EKS_NODE_ROLE_ARN"

# 3. 附加必需的策略
echo "   附加策略..."
aws iam attach-role-policy \
  --role-name eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy

aws iam attach-role-policy \
  --role-name eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

aws iam attach-role-policy \
  --role-name eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

aws iam attach-role-policy \
  --role-name eks-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

rm -f eks-node-trust-policy.json

# 保存到配置文件
echo "EKS_NODE_ROLE_ARN=$EKS_NODE_ROLE_ARN" >> eks-vpc-config.env

echo ""
echo "================================================"
echo "EKS 节点角色创建完成!"
echo "Role ARN: $EKS_NODE_ROLE_ARN"
echo "================================================"

安全组配置

创建集群安全组

#!/bin/bash
# create-eks-security-groups.sh

set -e

source eks-vpc-config.env

echo "================================================"
echo "创建 EKS 安全组"
echo "================================================"

# 1. 集群安全组(控制平面)
echo "1. 创建集群安全组..."
CLUSTER_SG=$(aws ec2 create-security-group \
  --group-name eks-cluster-sg \
  --description "Security group for EKS cluster control plane" \
  --vpc-id $VPC_ID \
  --tag-specifications "ResourceType=security-group,Tags=[
    {Key=Name,Value=eks-cluster-sg}
  ]" \
  --region $REGION \
  --query 'GroupId' \
  --output text)

echo "   ✓ 集群安全组: $CLUSTER_SG"

# 2. 节点安全组
echo ""
echo "2. 创建节点安全组..."
NODE_SG=$(aws ec2 create-security-group \
  --group-name eks-node-sg \
  --description "Security group for EKS worker nodes" \
  --vpc-id $VPC_ID \
  --tag-specifications "ResourceType=security-group,Tags=[
    {Key=Name,Value=eks-node-sg},
    {Key=kubernetes.io/cluster/$CLUSTER_NAME,Value=owned}
  ]" \
  --region $REGION \
  --query 'GroupId' \
  --output text)

echo "   ✓ 节点安全组: $NODE_SG"

# 3. 配置安全组规则

# 节点之间互相通信(所有协议)
echo ""
echo "3. 配置安全组规则..."
aws ec2 authorize-security-group-ingress \
  --group-id $NODE_SG \
  --source-group $NODE_SG \
  --protocol -1 \
  --region $REGION

echo "   ✓ 节点间通信规则"

# 节点访问集群 API(HTTPS)
aws ec2 authorize-security-group-ingress \
  --group-id $CLUSTER_SG \
  --source-group $NODE_SG \
  --protocol tcp \
  --port 443 \
  --region $REGION

echo "   ✓ 节点到集群 API 规则"

# 集群访问节点(kubelet API)
aws ec2 authorize-security-group-ingress \
  --group-id $NODE_SG \
  --source-group $CLUSTER_SG \
  --protocol tcp \
  --port 10250 \
  --region $REGION

echo "   ✓ 集群到节点 kubelet 规则"

# 集群访问节点(扩展 API server)
aws ec2 authorize-security-group-ingress \
  --group-id $NODE_SG \
  --source-group $CLUSTER_SG \
  --protocol tcp \
  --port 443 \
  --region $REGION

echo "   ✓ 集群到节点扩展 API 规则"

# 保存配置
cat >> eks-vpc-config.env << EOF
CLUSTER_SG=$CLUSTER_SG
NODE_SG=$NODE_SG
EOF

echo ""
echo "================================================"
echo "安全组创建完成!"
echo "集群安全组: $CLUSTER_SG"
echo "节点安全组: $NODE_SG"
echo "================================================"

创建 EKS 集群

集群创建脚本

#!/bin/bash
# create-eks-cluster.sh

set -e

source eks-vpc-config.env

K8S_VERSION="1.28"

echo "================================================"
echo "创建 EKS 集群"
echo "================================================"

# 1. 创建集群
echo "1. 创建 EKS 控制平面..."
aws eks create-cluster \
  --name $CLUSTER_NAME \
  --role-arn $EKS_CLUSTER_ROLE_ARN \
  --resources-vpc-config \
    subnetIds=${PRIVATE_SUBNET_IDS[0]},${PRIVATE_SUBNET_IDS[1]},${PRIVATE_SUBNET_IDS[2]},\
securityGroupIds=$CLUSTER_SG,\
endpointPublicAccess=true,\
endpointPrivateAccess=true,\
publicAccessCidrs="0.0.0.0/0" \
  --kubernetes-version $K8S_VERSION \
  --logging '{"clusterLogging":[
    {"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}
  ]}' \
  --tags \
    Environment=production \
    Project=eks-cluster \
  --region $REGION

echo "   ✓ 集群创建请求已提交"
echo "   ⏳ 等待集群创建完成(约 10-15 分钟)..."

# 2. 等待集群就绪
aws eks wait cluster-active \
  --name $CLUSTER_NAME \
  --region $REGION

# 3. 获取集群信息
CLUSTER_ENDPOINT=$(aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.endpoint' \
  --output text)

echo "   ✓ 集群已就绪"
echo "   端点:$CLUSTER_ENDPOINT"

# 4. 配置 kubectl
echo ""
echo "2. 配置 kubectl..."
aws eks update-kubeconfig \
  --name $CLUSTER_NAME \
  --region $REGION

echo "   ✓ kubeconfig 已更新"

# 5. 验证集群
echo ""
echo "3. 验证集群连接..."
kubectl cluster-info
kubectl get svc

# 6. 创建 OIDC Provider(用于 IRSA)
echo ""
echo "4. 创建 OIDC Identity Provider..."
OIDC_ID=$(aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.identity.oidc.issuer' \
  --output text | sed 's|https://||')

# 检查是否已存在
EXISTING_OIDC=$(aws iam list-open-id-connect-providers \
  --query "OpenIDConnectProviderList[?contains(Arn, '$OIDC_ID')].Arn" \
  --output text)

if [ -z "$EXISTING_OIDC" ]; then
  # 创建 OIDC Provider
  eksctl utils associate-iam-oidc-provider \
    --cluster $CLUSTER_NAME \
    --region $REGION \
    --approve
  
  echo "   ✓ OIDC Provider 已创建"
else
  echo "   ✓ OIDC Provider 已存在"
fi

# 保存 OIDC ID
echo "OIDC_ID=$OIDC_ID" >> eks-vpc-config.env

echo ""
echo "================================================"
echo "EKS 集群创建完成!"
echo "================================================"
echo "集群名称:$CLUSTER_NAME"
echo "集群端点:$CLUSTER_ENDPOINT"
echo "Kubernetes 版本:$K8S_VERSION"
echo "OIDC Provider:$OIDC_ID"
echo ""
echo "后续步骤:"
echo "  1. 创建节点组(运行 create-node-groups.sh)"
echo "  2. 安装核心插件(运行 install-addons.sh)"
echo "  3. 部署应用"
echo "================================================"

集群端点访问配置

访问模式对比

三种端点访问模式:

1. 公有端点 + 私有端点(推荐生产环境)
   优点:
   ├─ 开发者可从办公室/VPN 访问
   ├─ CI/CD 管道可直接访问
   ├─ 内部节点通过私有端点低延迟访问
   └─ 灵活性最好
   
   配置:
   endpointPublicAccess=true
   endpointPrivateAccess=true
   publicAccessCidrs=["公司IP/32", "VPN IP/24"]

2. 仅私有端点(最安全)
   优点:
   ├─ 最高安全性
   ├─ API Server 完全隔离
   └─ 无公网暴露
   
   限制:
   ├─ 必须通过 VPN 或 Bastion 访问
   ├─ CI/CD 需要在 VPC 内
   └─ 运维复杂度增加
   
   配置:
   endpointPublicAccess=false
   endpointPrivateAccess=true

3. 仅公有端点(不推荐生产)
   适用:
   ├─ 开发/测试环境
   └─ 快速验证
   
   风险:
   ├─ API Server 暴露公网
   ├─ 节点通过公网访问控制平面
   └─ 延迟较高
   
   配置:
   endpointPublicAccess=true
   endpointPrivateAccess=false

限制公有端点访问

#!/bin/bash
# restrict-api-access.sh

# 更新集群 API 端点访问限制
aws eks update-cluster-config \
  --name $CLUSTER_NAME \
  --resources-vpc-config \
    endpointPublicAccess=true,\
endpointPrivateAccess=true,\
publicAccessCidrs="203.0.113.0/24","198.51.100.0/24" \
  --region $REGION

echo "API 端点访问已限制到指定 IP 范围"

集群验证

验证集群状态

#!/bin/bash
# verify-cluster.sh

echo "================================================"
echo "验证 EKS 集群"
echo "================================================"

# 1. 检查集群状态
echo "1. 集群状态:"
aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.status' \
  --output text

# 2. 检查控制平面日志
echo ""
echo "2. 控制平面日志配置:"
aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.logging.clusterLogging[0]'

# 3. 检查网络配置
echo ""
echo "3. 网络配置:"
aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.resourcesVpcConfig' \
  --output json

# 4. kubectl 验证
echo ""
echo "4. Kubernetes 组件状态:"
kubectl get componentstatuses
kubectl get pods -n kube-system

echo ""
echo "================================================"
echo "集群验证完成!"
echo "================================================"

集群创建完成,下一步创建节点组!