EKS 集群创建和配置

本章节详细介绍如何在生产环境中创建和配置 Amazon EKS 集群,包括 IAM 角色配置、集群创建、节点组管理、插件安装和监控配置。

EKS 架构设计

集群规划

集群配置概览:

集群名称:production-eks-cluster
Kubernetes 版本:1.28
区域:us-east-1
可用区:us-east-1a, us-east-1b, us-east-1c

网络配置:
├─ VPC:10.0.0.0/16
├─ 控制平面子网:私有应用子网(3个AZ)
├─ 工作节点子网:私有应用子网(3个AZ)
└─ Pod 网络:VPC CNI(Secondary CIDR 可选)

端点访问:
├─ 公有端点:启用(受限 IP)
├─ 私有端点:启用
└─ 混合访问模式(推荐)

日志启用:
├─ API Server
├─ Audit
├─ Authenticator
├─ Controller Manager
└─ Scheduler

控制平面和数据平面

架构图:

┌─────────────────────────────────────────────────────┐
│              AWS 托管控制平面                        │
│  ┌──────────────────────────────────────────────┐  │
│  │  API Server (Multi-AZ)                       │  │
│  │  etcd (Multi-AZ, 自动备份)                   │  │
│  │  Controller Manager                          │  │
│  │  Scheduler                                   │  │
│  └──────────────────────────────────────────────┘  │
└─────────────────────┬───────────────────────────────┘
                      │ (安全组: sg-eks-cp)
                      │ 端点: 公有 + 私有
                      ▼
┌─────────────────────────────────────────────────────┐
│                 工作节点(数据平面)                 │
│                                                      │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐   │
│  │  us-east-1a│  │  us-east-1b│  │  us-east-1c│   │
│  │            │  │            │  │            │   │
│  │  Node 1    │  │  Node 3    │  │  Node 5    │   │
│  │  Node 2    │  │  Node 4    │  │  Node 6    │   │
│  │            │  │            │  │            │   │
│  │  Pods...   │  │  Pods...   │  │  Pods...   │   │
│  └────────────┘  └────────────┘  └────────────┘   │
│                                                      │
│  (安全组: sg-eks-nodes)                             │
│  (IAM 角色: eks-node-role)                          │
└─────────────────────────────────────────────────────┘

网络模式

VPC CNI 插件:

工作原理:
├─ 每个 Pod 获得 VPC 内的真实 IP
├─ Pod 可以直接与 VPC 资源通信
├─ 无需额外的 NAT 或覆盖网络
└─ 性能最优

IP 地址管理:
├─ 主网卡:节点主 IP
├─ 辅助网卡:Pod IP 池
├─ 每个实例类型有不同的 ENI 和 IP 限制
└─ 需要仔细规划子网大小

示例(m5.xlarge):
├─ 最大 ENI:4 个
├─ 每个 ENI 的 IP:15 个
├─ 总可用 IP:4 × 15 - 1 = 59 个
│   (减去主网卡的主 IP)
├─ 最大 Pod 数:58 个
└─ 实际配置:建议保留 10-20% 余量

IAM 角色准备

EKS 集群角色

用途: EKS 控制平面调用 AWS API

信任策略:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

创建脚本:

#!/bin/bash
# create-eks-cluster-role.sh

REGION="us-east-1"
ROLE_NAME="eks-cluster-role"

echo "创建 EKS 集群 IAM 角色..."

# 创建信任策略文件
cat > eks-cluster-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# 创建角色
aws iam create-role \
  --role-name $ROLE_NAME \
  --assume-role-policy-document file://eks-cluster-trust-policy.json \
  --description "IAM role for EKS cluster" \
  --region $REGION

# 附加必需的托管策略
echo "附加 AmazonEKSClusterPolicy..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy \
  --region $REGION

# 附加 VPC 资源控制器策略
echo "附加 AmazonEKSVPCResourceController..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSVPCResourceController \
  --region $REGION

# 获取角色 ARN
CLUSTER_ROLE_ARN=$(aws iam get-role \
  --role-name $ROLE_NAME \
  --query 'Role.Arn' \
  --output text)

echo "EKS 集群角色 ARN: $CLUSTER_ROLE_ARN"
echo "export EKS_CLUSTER_ROLE_ARN=$CLUSTER_ROLE_ARN" >> eks-config.sh

# 清理临时文件
rm -f eks-cluster-trust-policy.json

echo "✓ EKS 集群角色创建完成"

EKS Node 角色

用途: Worker 节点(EC2 实例)调用 AWS API

信任策略:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

创建脚本:

#!/bin/bash
# create-eks-node-role.sh

REGION="us-east-1"
ROLE_NAME="eks-node-role"

echo "创建 EKS Node IAM 角色..."

# 创建信任策略文件
cat > eks-node-trust-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# 创建角色
aws iam create-role \
  --role-name $ROLE_NAME \
  --assume-role-policy-document file://eks-node-trust-policy.json \
  --description "IAM role for EKS worker nodes" \
  --region $REGION

# 附加必需的托管策略

# 1. 核心 Worker Node 策略
echo "附加 AmazonEKSWorkerNodePolicy..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy \
  --region $REGION

# 2. CNI 策略(Pod 网络)
echo "附加 AmazonEKS_CNI_Policy..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy \
  --region $REGION

# 3. ECR 只读策略(拉取镜像)
echo "附加 AmazonEC2ContainerRegistryReadOnly..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly \
  --region $REGION

# 4. SSM 策略(远程管理)
echo "附加 AmazonSSMManagedInstanceCore..."
aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore \
  --region $REGION

# 创建自定义策略(CloudWatch Logs)
echo "创建 CloudWatch Logs 策略..."
cat > eks-node-cloudwatch-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    }
  ]
}
EOF

aws iam put-role-policy \
  --role-name $ROLE_NAME \
  --policy-name EKSNodeCloudWatchPolicy \
  --policy-document file://eks-node-cloudwatch-policy.json \
  --region $REGION

# 获取角色 ARN
NODE_ROLE_ARN=$(aws iam get-role \
  --role-name $ROLE_NAME \
  --query 'Role.Arn' \
  --output text)

echo "EKS Node 角色 ARN: $NODE_ROLE_ARN"
echo "export EKS_NODE_ROLE_ARN=$NODE_ROLE_ARN" >> eks-config.sh

# 清理临时文件
rm -f eks-node-trust-policy.json eks-node-cloudwatch-policy.json

echo "✓ EKS Node 角色创建完成"

EKS 集群创建

创建集群脚本

#!/bin/bash
# create-eks-cluster.sh

set -e

# 加载配置
source vpc-config.sh
source sg-config.sh
source eks-config.sh

REGION="us-east-1"
CLUSTER_NAME="production-eks-cluster"
K8S_VERSION="1.28"
OFFICE_IP="203.0.113.0/24"  # 替换为实际办公室 IP

echo "================================================"
echo "创建 EKS 集群: $CLUSTER_NAME"
echo "版本: $K8S_VERSION"
echo "区域: $REGION"
echo "================================================"

# 1. 创建集群
echo ""
echo "1. 创建 EKS 集群(预计 10-15 分钟)..."

aws eks create-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --kubernetes-version $K8S_VERSION \
  --role-arn $EKS_CLUSTER_ROLE_ARN \
  --resources-vpc-config \
    subnetIds=$PRIVATE_APP_SUBNET_1A,$PRIVATE_APP_SUBNET_1B,$PRIVATE_APP_SUBNET_1C,\
securityGroupIds=$EKS_CP_SG_ID,\
endpointPublicAccess=true,\
endpointPrivateAccess=true,\
publicAccessCidrs="$OFFICE_IP" \
  --logging \
    '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}' \
  --tags \
    Environment=production,\
    ManagedBy=script,\
    Team=platform

echo "   集群创建请求已提交"

# 2. 等待集群激活
echo ""
echo "2. 等待集群变为 ACTIVE 状态..."
aws eks wait cluster-active \
  --name $CLUSTER_NAME \
  --region $REGION

echo "   ✓ 集群已激活"

# 3. 更新 kubeconfig
echo ""
echo "3. 更新 kubeconfig..."
aws eks update-kubeconfig \
  --name $CLUSTER_NAME \
  --region $REGION

echo "   ✓ kubeconfig 已更新"

# 4. 验证连接
echo ""
echo "4. 验证集群连接..."
kubectl cluster-info
kubectl get svc

# 5. 创建 OIDC Provider(用于 IRSA)
echo ""
echo "5. 创建 OIDC Identity Provider..."

# 获取 OIDC Issuer URL
OIDC_ISSUER=$(aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.identity.oidc.issuer' \
  --output text)

echo "   OIDC Issuer: $OIDC_ISSUER"

# 提取 OIDC ID
OIDC_ID=$(echo $OIDC_ISSUER | cut -d '/' -f 5)
echo "   OIDC ID: $OIDC_ID"

# 检查是否已存在
EXISTING_PROVIDER=$(aws iam list-open-id-connect-providers \
  --query "OpenIDConnectProviderList[?contains(Arn, '$OIDC_ID')].Arn" \
  --output text)

if [ -z "$EXISTING_PROVIDER" ]; then
  # 获取根 CA 指纹
  THUMBPRINT=$(echo | openssl s_client -servername oidc.eks.$REGION.amazonaws.com \
    -connect oidc.eks.$REGION.amazonaws.com:443 2>/dev/null | \
    openssl x509 -fingerprint -noout | \
    sed 's/://g' | \
    awk -F= '{print tolower($2)}')
  
  # 创建 OIDC Provider
  aws iam create-open-id-connect-provider \
    --url $OIDC_ISSUER \
    --client-id-list sts.amazonaws.com \
    --thumbprint-list $THUMBPRINT \
    --region $REGION
  
  echo "   ✓ OIDC Provider 已创建"
else
  echo "   ✓ OIDC Provider 已存在: $EXISTING_PROVIDER"
fi

# 6. 输出集群信息
echo ""
echo "================================================"
echo "EKS 集群创建完成!"
echo "================================================"
echo ""
echo "集群信息:"
aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --region $REGION \
  --query 'cluster.{Name:name,Status:status,Version:version,Endpoint:endpoint,CreatedAt:createdAt}'

echo ""
echo "控制平面日志已启用:"
echo "  ✓ API Server"
echo "  ✓ Audit"
echo "  ✓ Authenticator"
echo "  ✓ Controller Manager"
echo "  ✓ Scheduler"
echo ""
echo "端点访问:"
echo "  ✓ 公有访问:启用(受限 IP)"
echo "  ✓ 私有访问:启用"
echo ""
echo "下一步:创建节点组"
echo "  运行: ./create-node-groups.sh"
echo "================================================"

# 保存集群信息
echo "export CLUSTER_NAME=$CLUSTER_NAME" >> eks-config.sh
echo "export OIDC_ISSUER=$OIDC_ISSUER" >> eks-config.sh
echo "export OIDC_ID=$OIDC_ID" >> eks-config.sh

Managed Node Groups

节点组规划

节点组配置:

1. 通用型节点组(On-Demand):
   ├─ 名称:general-nodes
   ├─ 实例类型:m5.xlarge
   ├─ 数量:6 个(每 AZ 2 个)
   ├─ 最小:6,最大:12
   ├─ 磁盘:50GB gp3
   └─ 用途:核心业务服务

2. Spot 实例节点组:
   ├─ 名称:spot-nodes
   ├─ 实例类型:m5.xlarge, c5.xlarge
   ├─ 数量:3 个(每 AZ 1 个)
   ├─ 最小:3,最大:9
   ├─ 磁盘:50GB gp3
   └─ 用途:无状态服务、批处理

3. 计算密集型节点组(可选):
   ├─ 名称:compute-nodes
   ├─ 实例类型:c5.2xlarge
   ├─ 数量:3 个
   ├─ 磁盘:100GB gp3
   └─ 用途:CPU 密集型任务

创建通用节点组

#!/bin/bash
# create-general-node-group.sh

source vpc-config.sh
source sg-config.sh
source eks-config.sh

REGION="us-east-1"
NODE_GROUP_NAME="general-nodes"

echo "================================================"
echo "创建通用型节点组: $NODE_GROUP_NAME"
echo "================================================"

# 创建节点组
aws eks create-nodegroup \
  --cluster-name $CLUSTER_NAME \
  --nodegroup-name $NODE_GROUP_NAME \
  --region $REGION \
  --node-role $EKS_NODE_ROLE_ARN \
  --subnets $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --instance-types m5.xlarge \
  --scaling-config \
    minSize=6,\
    maxSize=12,\
    desiredSize=6 \
  --disk-size 50 \
  --remote-access \
    ec2SshKey=my-keypair,\
    sourceSecurityGroups=$BASTION_SG_ID \
  --labels \
    role=general,\
    environment=production \
  --tags \
    Environment=production,\
    NodeGroup=general,\
    ManagedBy=eks

echo "   节点组创建请求已提交"

# 等待节点组激活
echo ""
echo "等待节点组激活(预计 5-10 分钟)..."
aws eks wait nodegroup-active \
  --cluster-name $CLUSTER_NAME \
  --nodegroup-name $NODE_GROUP_NAME \
  --region $REGION

echo "   ✓ 节点组已激活"

# 验证节点
echo ""
echo "验证节点状态..."
kubectl get nodes \
  --label-columns=role,environment,node.kubernetes.io/instance-type

echo ""
echo "================================================"
echo "通用型节点组创建完成!"
echo "================================================"

创建 Spot 节点组

#!/bin/bash
# create-spot-node-group.sh

source vpc-config.sh
source eks-config.sh

REGION="us-east-1"
NODE_GROUP_NAME="spot-nodes"

echo "================================================"
echo "创建 Spot 实例节点组: $NODE_GROUP_NAME"
echo "================================================"

# 创建节点组
aws eks create-nodegroup \
  --cluster-name $CLUSTER_NAME \
  --nodegroup-name $NODE_GROUP_NAME \
  --region $REGION \
  --node-role $EKS_NODE_ROLE_ARN \
  --subnets $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --instance-types m5.xlarge c5.xlarge \
  --capacity-type SPOT \
  --scaling-config \
    minSize=3,\
    maxSize=9,\
    desiredSize=3 \
  --disk-size 50 \
  --labels \
    role=spot,\
    environment=production,\
    workload=stateless \
  --taints \
    key=spot,value=true,effect=NoSchedule \
  --tags \
    Environment=production,\
    NodeGroup=spot,\
    CapacityType=SPOT

echo "   Spot 节点组创建请求已提交"

# 等待节点组激活
echo ""
echo "等待节点组激活..."
aws eks wait nodegroup-active \
  --cluster-name $CLUSTER_NAME \
  --nodegroup-name $NODE_GROUP_NAME \
  --region $REGION

echo "   ✓ Spot 节点组已激活"

# 验证节点
echo ""
echo "验证 Spot 节点..."
kubectl get nodes -l role=spot

echo ""
echo "================================================"
echo "Spot 节点组创建完成!"
echo ""
echo "⚠️  注意:"
echo "  Spot 节点有污点 spot=true:NoSchedule"
echo "  Pod 需要添加容忍度才能调度到 Spot 节点"
echo ""
echo "示例容忍度:"
echo "  tolerations:"
echo "  - key: spot"
echo "    value: \"true\""
echo "    effect: NoSchedule"
echo "================================================"

安装核心插件

VPC CNI 插件

作用: Pod 网络管理

#!/bin/bash
# install-vpc-cni.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"

echo "配置 VPC CNI 插件..."

# 获取当前版本
CURRENT_VERSION=$(kubectl get daemonset aws-node -n kube-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}' | \
  cut -d: -f2)

echo "   当前版本: $CURRENT_VERSION"

# 推荐版本(根据 EKS 版本)
RECOMMENDED_VERSION="v1.15.1-eksbuild.1"

# 更新插件
aws eks update-addon \
  --cluster-name $CLUSTER_NAME \
  --addon-name vpc-cni \
  --addon-version $RECOMMENDED_VERSION \
  --region $REGION \
  --resolve-conflicts OVERWRITE

echo "   ✓ VPC CNI 插件已更新"

# 配置环境变量(可选优化)
kubectl set env daemonset aws-node \
  -n kube-system \
  ENABLE_PREFIX_DELEGATION=true \
  ENABLE_POD_ENI=true \
  POD_SECURITY_GROUP_ENFORCING_MODE=standard \
  WARM_ENI_TARGET=1 \
  WARM_IP_TARGET=5

echo "   ✓ VPC CNI 配置已优化"

CoreDNS

#!/bin/bash
# install-coredns.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"

echo "配置 CoreDNS..."

# 更新 CoreDNS
aws eks update-addon \
  --cluster-name $CLUSTER_NAME \
  --addon-name coredns \
  --region $REGION \
  --resolve-conflicts OVERWRITE

echo "   ✓ CoreDNS 已更新"

# 检查 CoreDNS Pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

EBS CSI Driver

#!/bin/bash
# install-ebs-csi-driver.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "安装 EBS CSI Driver..."

# 1. 创建 IRSA 角色
ROLE_NAME="AmazonEKS_EBS_CSI_DriverRole"

cat > ebs-csi-trust-policy.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
          "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:ebs-csi-controller-sa"
        }
      }
    }
  ]
}
EOF

aws iam create-role \
  --role-name $ROLE_NAME \
  --assume-role-policy-document file://ebs-csi-trust-policy.json

aws iam attach-role-policy \
  --role-name $ROLE_NAME \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

EBS_CSI_ROLE_ARN=$(aws iam get-role --role-name $ROLE_NAME --query 'Role.Arn' --output text)

# 2. 安装插件
aws eks create-addon \
  --cluster-name $CLUSTER_NAME \
  --addon-name aws-ebs-csi-driver \
  --service-account-role-arn $EBS_CSI_ROLE_ARN \
  --region $REGION

echo "   ✓ EBS CSI Driver 已安装"

# 3. 创建 StorageClass
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
allowVolumeExpansion: true
EOF

echo "   ✓ gp3 StorageClass 已创建(默认)"

rm -f ebs-csi-trust-policy.json

AWS Load Balancer Controller

安装脚本

#!/bin/bash
# install-aws-load-balancer-controller.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "================================================"
echo "安装 AWS Load Balancer Controller"
echo "================================================"

# 1. 创建 IAM 策略
echo ""
echo "1. 创建 IAM 策略..."

curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.6.2/docs/install/iam_policy.json

aws iam create-policy \
  --policy-name AWSLoadBalancerControllerIAMPolicy \
  --policy-document file://iam-policy.json

POLICY_ARN="arn:aws:iam::${ACCOUNT_ID}:policy/AWSLoadBalancerControllerIAMPolicy"
echo "   策略 ARN: $POLICY_ARN"

# 2. 创建 IRSA 角色
echo ""
echo "2. 创建 Service Account 和 IAM 角色..."

eksctl create iamserviceaccount \
  --cluster=$CLUSTER_NAME \
  --namespace=kube-system \
  --name=aws-load-balancer-controller \
  --attach-policy-arn=$POLICY_ARN \
  --approve \
  --region=$REGION

# 3. 安装 Helm
echo ""
echo "3. 添加 Helm 仓库..."

helm repo add eks https://aws.github.io/eks-charts
helm repo update

# 4. 安装 Controller
echo ""
echo "4. 安装 AWS Load Balancer Controller..."

helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=$CLUSTER_NAME \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller \
  --set region=$REGION \
  --set vpcId=$VPC_ID

echo ""
echo "5. 验证安装..."
kubectl get deployment -n kube-system aws-load-balancer-controller

echo ""
echo "================================================"
echo "AWS Load Balancer Controller 安装完成!"
echo "================================================"

rm -f iam-policy.json

CloudWatch Container Insights

安装 Fluent Bit

#!/bin/bash
# install-fluent-bit.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"

echo "安装 Fluent Bit(日志采集)..."

# 下载配置
curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluent-bit-quickstart.yaml | \
sed "s/{{cluster_name}}/$CLUSTER_NAME/;s/{{region_name}}/$REGION/" | \
kubectl apply -f -

echo "   ✓ Fluent Bit 已安装"

# 验证
kubectl get daemonset fluent-bit -n amazon-cloudwatch

启用 Container Insights

#!/bin/bash
# enable-container-insights.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"

echo "启用 Container Insights..."

aws eks update-cluster-config \
  --name $CLUSTER_NAME \
  --region $REGION \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

# 安装 CloudWatch Agent
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cloudwatch-namespace.yaml

kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cwagent/cwagent-serviceaccount.yaml

curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cwagent/cwagent-configmap.yaml | \
sed "s/{{cluster_name}}/$CLUSTER_NAME/" | \
kubectl apply -f -

kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/cwagent/cwagent-daemonset.yaml

echo "   ✓ Container Insights 已启用"
echo ""
echo "查看日志:AWS Console → CloudWatch → Container Insights"

集群验证

完整验证脚本

#!/bin/bash
# verify-eks-cluster.sh

CLUSTER_NAME="production-eks-cluster"

echo "================================================"
echo "EKS 集群验证"
echo "================================================"

# 1. 集群状态
echo ""
echo "1. 集群状态"
aws eks describe-cluster \
  --name $CLUSTER_NAME \
  --query 'cluster.{Name:name,Status:status,Version:version,PlatformVersion:platformVersion}'

# 2. 节点状态
echo ""
echo "2. 节点状态"
kubectl get nodes -o wide

# 3. 系统 Pods
echo ""
echo "3. 系统组件状态"
kubectl get pods -n kube-system

# 4. 存储类
echo ""
echo "4. 存储类"
kubectl get storageclass

# 5. Ingress 类
echo ""
echo "5. Ingress 类"
kubectl get ingressclass

# 6. 节点资源
echo ""
echo "6. 节点资源使用"
kubectl top nodes

# 7. 部署测试应用
echo ""
echo "7. 部署测试应用"
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer

echo "   等待 LoadBalancer..."
sleep 30

kubectl get svc nginx

echo ""
echo "================================================"
echo "验证完成!"
echo ""
echo "清理测试资源:"
echo "  kubectl delete svc nginx"
echo "  kubectl delete deployment nginx"
echo "================================================"

最佳实践总结

1. 集群配置

✓ 启用所有控制平面日志
✓ 使用混合端点访问(公有+私有)
✓ 限制公有端点访问 IP
✓ 启用 OIDC Provider(IRSA)
✓ 定期更新 Kubernetes 版本

2. 节点配置

✓ 多可用区部署
✓ 使用 Managed Node Groups
✓ 混合 On-Demand 和 Spot
✓ 适当的磁盘大小和类型
✓ 配置节点标签和污点

3. 网络配置

✓ 使用私有子网运行节点
✓ 每 AZ 独立 NAT Gateway
✓ 使用 VPC Endpoints 降低成本
✓ 合理规划 IP 地址
✓ 配置安全组引用

4. 安全配置

✓ 使用 IRSA 代替节点 IAM 角色
✓ 启用加密(etcd、EBS)
✓ 定期轮换凭证
✓ 配置 Pod Security Standards
✓ 启用审计日志

5. 监控配置

✓ 启用 Container Insights
✓ 配置日志采集
✓ 设置告警规则
✓ 监控成本
✓ 定期审查指标

下一步: 继续学习 应用部署和负载均衡 章节。