成本优化和最佳实践

本章节详细介绍如何优化 AWS 成本,包括成本分析、Spot 实例策略、Reserved Instances、资源右调优以及 FinOps 最佳实践。

成本分析架构

成本组成

总成本
├─ 计算成本 (40-50%)
│  ├─ EKS 控制平面:$0.10/小时/集群
│  ├─ EC2 实例(Worker Nodes)
│  │  ├─ On-Demand:标准价格
│  │  ├─ Reserved Instances:节省 40-60%
│  │  └─ Spot Instances:节省 70-90%
│  └─ NAT Gateway:$0.045/小时 + 数据处理费
│
├─ 存储成本 (15-20%)
│  ├─ EBS Volumes
│  │  ├─ gp3:$0.08/GB/月
│  │  ├─ gp2:$0.10/GB/月
│  │  └─ io2:$0.125/GB/月 + IOPS 费用
│  ├─ S3
│  │  ├─ Standard:$0.023/GB/月
│  │  ├─ IA:$0.0125/GB/月
│  │  └─ Glacier:$0.004/GB/月
│  └─ EFS:$0.30/GB/月
│
├─ 数据库成本 (20-30%)
│  ├─ RDS:实例 + 存储 + 备份
│  ├─ ElastiCache:节点费用
│  └─ DynamoDB:按需或预置容量
│
├─ 网络成本 (5-10%)
│  ├─ 数据传输出 AWS
│  ├─ 跨 AZ 数据传输
│  ├─ 跨区域数据传输
│  └─ NAT Gateway 数据处理
│
└─ 其他成本 (5-10%)
   ├─ CloudWatch Logs
   ├─ ALB/NLB
   ├─ Route 53
   └─ KMS

成本可视化

┌─────────────────────────────────────────────────────┐
│              AWS Cost Explorer                       │
│  ┌──────────────────────────────────────────────┐  │
│  │  按服务分组                                   │  │
│  │  按标签分组(环境、团队、项目)              │  │
│  │  按时间趋势                                   │  │
│  └──────────────────────────────────────────────┘  │
└───────────────────┬─────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────┐
│              AWS Cost Anomaly Detection              │
│  ┌──────────────────────────────────────────────┐  │
│  │  自动检测异常支出                             │  │
│  │  发送告警通知                                 │  │
│  └──────────────────────────────────────────────┘  │
└───────────────────┬─────────────────────────────────┘
                    ↓
┌─────────────────────────────────────────────────────┐
│              AWS Budgets                             │
│  ┌──────────────────────────────────────────────┐  │
│  │  设置预算阈值                                 │  │
│  │  超预算告警                                   │  │
│  │  预测性告警                                   │  │
│  └──────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

标签策略

成本分配标签

标签架构:

# 必需标签
Environment: production | staging | development
Team: platform | data | backend | frontend
Project: user-service | order-service | payment-service
CostCenter: engineering | product | marketing
Owner: team-email@company.com

# 可选标签
Application: api-gateway | database | cache
Component: compute | storage | network
ManagedBy: terraform | helm | manual
Backup: enabled | disabled

应用标签脚本:

#!/bin/bash
# tag-resources.sh

REGION="us-east-1"

echo "================================================"
echo "为资源添加成本分配标签"
echo "================================================"

# 1. 标记 EC2 实例
echo ""
echo "1. 标记 EC2 实例..."
aws ec2 describe-instances \
  --filters "Name=tag:kubernetes.io/cluster/production-eks-cluster,Values=owned" \
  --query 'Reservations[].Instances[].InstanceId' \
  --output text | \
xargs -I {} aws ec2 create-tags \
  --resources {} \
  --tags \
    Key=Environment,Value=production \
    Key=Team,Value=platform \
    Key=CostCenter,Value=engineering \
  --region $REGION

# 2. 标记 EBS 卷
echo ""
echo "2. 标记 EBS 卷..."
aws ec2 describe-volumes \
  --filters "Name=tag:kubernetes.io/cluster/production-eks-cluster,Values=owned" \
  --query 'Volumes[].VolumeId' \
  --output text | \
xargs -I {} aws ec2 create-tags \
  --resources {} \
  --tags \
    Key=Environment,Value=production \
    Key=Team,Value=platform \
    Key=CostCenter,Value=engineering \
  --region $REGION

# 3. 标记 RDS 实例
echo ""
echo "3. 标记 RDS 实例..."
aws rds list-tags-for-resource \
  --resource-name arn:aws:rds:us-east-1:123456789012:db:production-postgres-users \
  --region $REGION

aws rds add-tags-to-resource \
  --resource-name arn:aws:rds:us-east-1:123456789012:db:production-postgres-users \
  --tags \
    Key=Environment,Value=production \
    Key=Team,Value=backend \
    Key=CostCenter,Value=engineering \
  --region $REGION

# 4. 标记 S3 存储桶
echo ""
echo "4. 标记 S3 存储桶..."
aws s3api put-bucket-tagging \
  --bucket production-app-assets-123456789012 \
  --tagging 'TagSet=[
    {Key=Environment,Value=production},
    {Key=Team,Value=platform},
    {Key=CostCenter,Value=engineering}
  ]'

echo ""
echo "================================================"
echo "标签添加完成!"
echo "================================================"

Kubernetes 资源标签:

# deployment-with-cost-tags.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
  namespace: production
  labels:
    app: user-service
    environment: production
    team: backend
    cost-center: engineering
    project: user-management
spec:
  template:
    metadata:
      labels:
        app: user-service
        environment: production
        team: backend
        cost-center: engineering

计算成本优化

Spot 实例策略

Spot Node Group 配置:

#!/bin/bash
# create-spot-node-group-advanced.sh

CLUSTER_NAME="production-eks-cluster"
REGION="us-east-1"

echo "创建多实例类型 Spot 节点组..."

# 实例类型策略:
# - 选择多个实例类型提高可用性
# - 选择相似规格的实例类型
# - 使用 capacity-optimized 分配策略

aws eks create-nodegroup \
  --cluster-name $CLUSTER_NAME \
  --nodegroup-name spot-mixed-nodes \
  --node-role $EKS_NODE_ROLE_ARN \
  --subnets $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --instance-types m5.xlarge m5a.xlarge m5n.xlarge c5.xlarge c5a.xlarge \
  --capacity-type SPOT \
  --scaling-config minSize=3,maxSize=30,desiredSize=6 \
  --update-config maxUnavailable=1 \
  --labels \
    workload-type=stateless,\
    capacity-type=spot,\
    environment=production \
  --taints \
    key=spot,value=true,effect=NoSchedule \
  --tags \
    Environment=production,\
    CapacityType=SPOT,\
    CostOptimized=true \
  --region $REGION

echo "✓ Spot 节点组已创建"

Spot 中断处理:

# spot-interrupt-handler.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: aws-node-termination-handler
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: aws-node-termination-handler
  template:
    metadata:
      labels:
        app: aws-node-termination-handler
    spec:
      serviceAccountName: aws-node-termination-handler
      hostNetwork: true
      containers:
      - name: aws-node-termination-handler
        image: public.ecr.aws/aws-ec2/aws-node-termination-handler:v1.19.0
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ENABLE_SPOT_INTERRUPTION_DRAINING
          value: "true"
        - name: ENABLE_SCHEDULED_EVENT_DRAINING
          value: "true"
        - name: DELETE_LOCAL_DATA
          value: "true"
        - name: IGNORE_DAEMON_SETS
          value: "true"
        - name: POD_TERMINATION_GRACE_PERIOD
          value: "30"
        - name: WEBHOOK_URL
          value: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

Spot 实例适用工作负载:

# stateless-deployment-spot.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
  namespace: production
spec:
  replicas: 10
  selector:
    matchLabels:
      app: batch-processor
  template:
    metadata:
      labels:
        app: batch-processor
    spec:
      # 调度到 Spot 节点
      tolerations:
      - key: spot
        value: "true"
        effect: NoSchedule
      
      nodeSelector:
        capacity-type: spot
      
      # 优雅关闭
      terminationGracePeriodSeconds: 120
      
      containers:
      - name: processor
        image: batch-processor:v1.0.0
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]

Reserved Instances 策略

RI 购买分析:

#!/bin/bash
# analyze-ri-opportunities.sh

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "================================================"
echo "分析 Reserved Instance 购买机会"
echo "================================================"

# 1. 获取过去 30 天的实例使用情况
echo ""
echo "1. 分析实例使用模式..."
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity DAILY \
  --metrics UnblendedCost UsageQuantity \
  --group-by Type=DIMENSION,Key=INSTANCE_TYPE \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["Amazon Elastic Compute Cloud - Compute"]
    }
  }' \
  --region $REGION

# 2. 获取 RI 推荐
echo ""
echo "2. 获取 AWS RI 购买建议..."
aws ce get-reservation-purchase-recommendation \
  --service "Amazon Elastic Compute Cloud - Compute" \
  --lookback-period-in-days SIXTY_DAYS \
  --term-in-years ONE_YEAR \
  --payment-option PARTIAL_UPFRONT \
  --region $REGION

# 3. 当前 RI 利用率
echo ""
echo "3. 当前 RI 利用率..."
aws ce get-reservation-utilization \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --region $REGION

echo ""
echo "================================================"
echo "建议:"
echo "  1. 对稳定运行的节点组使用 RI"
echo "  2. 推荐 1 年期 Partial Upfront"
echo "  3. 定期审查 RI 利用率"
echo "================================================"

RI 购买脚本:

#!/bin/bash
# purchase-reserved-instances.sh

REGION="us-east-1"

echo "购买 Reserved Instances..."

# 示例:购买 3 个 m5.xlarge 实例,1 年期,部分预付
aws ec2 purchase-reserved-instances-offering \
  --reserved-instances-offering-id <OFFERING_ID> \
  --instance-count 3 \
  --region $REGION

echo "✓ RI 购买完成"

Savings Plans

Savings Plans 分析:

#!/bin/bash
# analyze-savings-plans.sh

REGION="us-east-1"

echo "分析 Savings Plans 机会..."

# 获取 Savings Plans 推荐
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \
  --term-in-years ONE_YEAR \
  --payment-option PARTIAL_UPFRONT \
  --lookback-period-in-days SIXTY_DAYS \
  --region $REGION

echo ""
echo "Compute Savings Plans vs EC2 Reserved Instances:"
echo "  Compute SP:更灵活,可跨实例类型、区域、操作系统"
echo "  EC2 RI:折扣更高,但灵活性低"
echo ""
echo "推荐策略:"
echo "  1. 基础容量:EC2 RI(最大折扣)"
echo "  2. 弹性容量:Compute SP(灵活性)"
echo "  3. 峰值容量:On-Demand + Spot"

存储成本优化

EBS 优化

EBS 卷分析:

#!/bin/bash
# analyze-ebs-volumes.sh

REGION="us-east-1"

echo "================================================"
echo "分析 EBS 卷优化机会"
echo "================================================"

# 1. 查找未使用的卷
echo ""
echo "1. 未使用的 EBS 卷:"
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,Size:Size,Type:VolumeType,CreateTime:CreateTime}' \
  --output table \
  --region $REGION

# 2. 查找低利用率的卷
echo ""
echo "2. 过去 7 天平均 IOPS < 100 的卷:"
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name VolumeReadOps \
  --dimensions Name=VolumeId,Value=vol-xxxxx \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average \
  --region $REGION

# 3. 查找 gp2 卷(可迁移到 gp3)
echo ""
echo "3. 可优化为 gp3 的 gp2 卷:"
aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[].{ID:VolumeId,Size:Size,State:State}' \
  --output table \
  --region $REGION

echo ""
echo "================================================"
echo "优化建议:"
echo "  1. 删除未使用的卷(节省 100%)"
echo "  2. gp2 → gp3(节省 20%,性能提升)"
echo "  3. 减小过度配置的卷"
echo "================================================"

迁移到 gp3:

#!/bin/bash
# migrate-gp2-to-gp3.sh

REGION="us-east-1"

echo "迁移 gp2 到 gp3..."

# 获取所有 gp2 卷
GP2_VOLUMES=$(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[].VolumeId' \
  --output text \
  --region $REGION)

for VOLUME_ID in $GP2_VOLUMES; do
  echo "迁移卷: $VOLUME_ID"
  
  aws ec2 modify-volume \
    --volume-id $VOLUME_ID \
    --volume-type gp3 \
    --iops 3000 \
    --throughput 125 \
    --region $REGION
  
  echo "  ✓ 已提交修改请求"
done

echo ""
echo "注意:卷类型修改可能需要几分钟到几小时"

S3 生命周期策略

智能分层策略:

#!/bin/bash
# configure-s3-lifecycle.sh

BUCKET="production-app-assets-123456789012"

echo "配置 S3 生命周期策略..."

cat > lifecycle-policy.json << 'EOF'
{
  "Rules": [
    {
      "Id": "IntelligentTiering",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "assets/"
      },
      "Transitions": [
        {
          "Days": 0,
          "StorageClass": "INTELLIGENT_TIERING"
        }
      ]
    },
    {
      "Id": "ArchiveOldLogs",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "logs/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 90,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 365
      }
    },
    {
      "Id": "DeleteOldVersions",
      "Status": "Enabled",
      "NoncurrentVersionTransitions": [
        {
          "NoncurrentDays": 30,
          "StorageClass": "GLACIER"
        }
      ],
      "NoncurrentVersionExpiration": {
        "NoncurrentDays": 90
      }
    },
    {
      "Id": "DeleteIncompleteMultipart",
      "Status": "Enabled",
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}
EOF

aws s3api put-bucket-lifecycle-configuration \
  --bucket $BUCKET \
  --lifecycle-configuration file://lifecycle-policy.json

echo "✓ 生命周期策略已配置"

rm -f lifecycle-policy.json

网络成本优化

NAT Gateway 优化

分析 NAT Gateway 成本:

#!/bin/bash
# analyze-nat-gateway-cost.sh

REGION="us-east-1"

echo "================================================"
echo "分析 NAT Gateway 成本"
echo "================================================"

# 1. NAT Gateway 数据处理量
echo ""
echo "1. 过去 30 天数据处理量:"
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-xxxxx \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Sum \
  --region $REGION

# 2. 计算成本
echo ""
echo "2. 成本估算:"
echo "  NAT Gateway 小时费用:$0.045/小时 = $32.4/月"
echo "  数据处理费用:$0.045/GB"
echo ""
echo "示例:处理 1TB 数据"
echo "  固定费用:$32.4"
echo "  数据费用:1000GB × $0.045 = $45"
echo "  总计:$77.4/月"

echo ""
echo "================================================"
echo "优化建议:"
echo "  1. 使用 VPC Endpoints 访问 AWS 服务"
echo "  2. 缓存外部 API 响应"
echo "  3. 使用 S3 Gateway Endpoint"
echo "  4. 考虑单 NAT Gateway(非生产环境)"
echo "================================================"

配置 VPC Endpoints:

#!/bin/bash
# create-vpc-endpoints.sh

source vpc-config.sh
source sg-config.sh

REGION="us-east-1"

echo "创建 VPC Endpoints(节省 NAT Gateway 成本)..."

# 1. S3 Gateway Endpoint(免费)
echo ""
echo "1. 创建 S3 Gateway Endpoint..."
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --service-name com.amazonaws.$REGION.s3 \
  --route-table-ids $PRIVATE_ROUTE_TABLE_ID \
  --region $REGION

# 2. ECR API Endpoint
echo ""
echo "2. 创建 ECR API Endpoint..."
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.$REGION.ecr.api \
  --subnet-ids $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --security-group-ids $VPCE_SG_ID \
  --region $REGION

# 3. ECR DKR Endpoint
echo ""
echo "3. 创建 ECR DKR Endpoint..."
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.$REGION.ecr.dkr \
  --subnet-ids $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --security-group-ids $VPCE_SG_ID \
  --region $REGION

# 4. CloudWatch Logs Endpoint
echo ""
echo "4. 创建 CloudWatch Logs Endpoint..."
aws ec2 create-vpc-endpoint \
  --vpc-id $VPC_ID \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.$REGION.logs \
  --subnet-ids $PRIVATE_APP_SUBNET_1A $PRIVATE_APP_SUBNET_1B $PRIVATE_APP_SUBNET_1C \
  --security-group-ids $VPCE_SG_ID \
  --region $REGION

echo ""
echo "✓ VPC Endpoints 已创建"
echo ""
echo "预计每月节省:"
echo "  - 减少 NAT Gateway 数据处理费用"
echo "  - Interface Endpoint 费用:$0.01/小时 = $7.2/月"
echo "  - 数据处理免费(AWS 服务)"

跨 AZ 数据传输优化

分析跨 AZ 流量:

# service-topology-aware.yaml
apiVersion: v1
kind: Service
metadata:
  name: user-service
  namespace: production
  annotations:
    service.kubernetes.io/topology-aware-hints: auto
spec:
  selector:
    app: user-service
  ports:
  - port: 9001
  type: ClusterIP
  # 拓扑感知路由(减少跨 AZ 流量)
  topologyKeys:
  - "topology.kubernetes.io/zone"
  - "*"

数据库成本优化

RDS 成本优化

RDS 实例右调优:

#!/bin/bash
# analyze-rds-utilization.sh

DB_INSTANCE="production-postgres-users"
REGION="us-east-1"

echo "分析 RDS 实例利用率..."

# CPU 利用率(过去 30 天)
echo ""
echo "CPU 利用率(30 天平均):"
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=$DB_INSTANCE \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average \
  --region $REGION

# 连接数
echo ""
echo "数据库连接数:"
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=$DB_INSTANCE \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Average,Maximum \
  --region $REGION

echo ""
echo "优化建议:"
echo "  - CPU < 40%:考虑降级实例类型"
echo "  - CPU 40-70%:当前配置合理"
echo "  - CPU > 70%:考虑升级或优化查询"

Aurora Serverless v2 迁移:

#!/bin/bash
# migrate-to-aurora-serverless.sh

echo "================================================"
echo "Aurora Serverless v2 成本对比"
echo "================================================"

echo ""
echo "传统 RDS (db.r6g.xlarge):"
echo "  固定成本:$0.42/小时 = $302.4/月"
echo ""
echo "Aurora Serverless v2:"
echo "  按需计费:$0.12/ACU/小时"
echo "  最小容量:0.5 ACU"
echo "  最大容量:根据负载自动扩展"
echo ""
echo "示例成本计算:"
echo "  平均 2 ACU × 720 小时 × $0.12 = $172.8/月"
echo "  节省:$302.4 - $172.8 = $129.6/月 (43%)"
echo ""
echo "适用场景:"
echo "  ✓ 间歇性工作负载"
echo "  ✓ 开发/测试环境"
echo "  ✓ 不可预测的流量模式"

DynamoDB 成本优化

按需 vs 预置容量:

#!/bin/bash
# analyze-dynamodb-cost.sh

TABLE_NAME="production-sessions"

echo "分析 DynamoDB 成本..."

# 过去 7 天的读写请求
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedReadCapacityUnits \
  --dimensions Name=TableName,Value=$TABLE_NAME \
  --start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 3600 \
  --statistics Sum \
  --region us-east-1

echo ""
echo "成本对比:"
echo ""
echo "按需模式:"
echo "  读:$1.25 per million"
echo "  写:$6.25 per million"
echo ""
echo "预置容量:"
echo "  读:$0.00013/RCU/小时 = $0.0936/RCU/月"
echo "  写:$0.00065/WCU/小时 = $0.468/WCU/月"
echo ""
echo "切换建议:"
echo "  - 流量不可预测 → 按需模式"
echo "  - 稳定流量 → 预置容量 + Auto Scaling"
echo "  - 预测流量 > 实际用量的 50% → 按需模式"

监控和日志成本优化

CloudWatch Logs 优化

配置日志保留策略:

#!/bin/bash
# optimize-cloudwatch-logs.sh

REGION="us-east-1"

echo "优化 CloudWatch Logs 成本..."

# 设置日志保留期
LOG_GROUPS=$(aws logs describe-log-groups \
  --query 'logGroups[].logGroupName' \
  --output text \
  --region $REGION)

for LOG_GROUP in $LOG_GROUPS; do
  echo "配置日志组: $LOG_GROUP"
  
  # 根据日志类型设置不同保留期
  if [[ $LOG_GROUP == *"/aws/eks/"* ]]; then
    RETENTION=7  # EKS 日志保留 7 天
  elif [[ $LOG_GROUP == *"/aws/rds/"* ]]; then
    RETENTION=30  # RDS 日志保留 30 天
  else
    RETENTION=14  # 其他日志保留 14 天
  fi
  
  aws logs put-retention-policy \
    --log-group-name $LOG_GROUP \
    --retention-in-days $RETENTION \
    --region $REGION
  
  echo "  ✓ 保留期:$RETENTION 天"
done

echo ""
echo "✓ 日志保留策略已配置"

导出到 S3(长期存储):

#!/bin/bash
# export-logs-to-s3.sh

LOG_GROUP="/aws/eks/production-eks-cluster/cluster"
BUCKET="production-logs-123456789012"
PREFIX="eks-logs"
FROM=$(date -u -d '7 days ago' +%s)000
TO=$(date -u +%s)000

echo "导出日志到 S3..."

TASK_ID=$(aws logs create-export-task \
  --log-group-name $LOG_GROUP \
  --from $FROM \
  --to $TO \
  --destination $BUCKET \
  --destination-prefix $PREFIX \
  --query 'taskId' \
  --output text)

echo "导出任务 ID: $TASK_ID"
echo ""
echo "成本对比:"
echo "  CloudWatch Logs:$0.50/GB/月"
echo "  S3 Standard:$0.023/GB/月"
echo "  节省:95%"

成本治理和 FinOps

AWS Budgets 配置

#!/bin/bash
# create-cost-budgets.sh

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
EMAIL="finance-team@company.com"

echo "创建成本预算..."

# 1. 每月总成本预算
cat > monthly-budget.json << EOF
{
  "BudgetName": "monthly-total-cost",
  "BudgetType": "COST",
  "TimeUnit": "MONTHLY",
  "BudgetLimit": {
    "Amount": "10000",
    "Unit": "USD"
  },
  "CostFilters": {},
  "CostTypes": {
    "IncludeTax": true,
    "IncludeSubscription": true,
    "UseBlended": false,
    "IncludeRefund": false,
    "IncludeCredit": false,
    "IncludeUpfront": true,
    "IncludeRecurring": true,
    "IncludeOtherSubscription": true,
    "IncludeSupport": true,
    "IncludeDiscount": true,
    "UseAmortized": false
  },
  "TimePeriod": {
    "Start": "2024-01-01T00:00:00Z",
    "End": "2087-06-15T00:00:00Z"
  }
}
EOF

cat > budget-notifications.json << EOF
[
  {
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [
      {
        "SubscriptionType": "EMAIL",
        "Address": "$EMAIL"
      }
    ]
  },
  {
    "Notification": {
      "NotificationType": "FORECASTED",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 100,
      "ThresholdType": "PERCENTAGE"
    },
    "Subscribers": [
      {
        "SubscriptionType": "EMAIL",
        "Address": "$EMAIL"
      }
    ]
  }
]
EOF

aws budgets create-budget \
  --account-id $ACCOUNT_ID \
  --budget file://monthly-budget.json \
  --notifications-with-subscribers file://budget-notifications.json

echo "✓ 预算已创建"

rm -f monthly-budget.json budget-notifications.json

成本异常检测

#!/bin/bash
# setup-cost-anomaly-detection.sh

echo "配置成本异常检测..."

# 创建成本异常监控器
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "Production Cost Monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

# 创建订阅
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "Daily Cost Anomaly Alert",
    "Threshold": 100,
    "Frequency": "DAILY",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/xxx"],
    "Subscribers": [
      {
        "Type": "EMAIL",
        "Address": "finance-team@company.com"
      }
    ]
  }'

echo "✓ 异常检测已配置"

成本报告自动化

#!/bin/bash
# generate-cost-report.sh

REGION="us-east-1"
START_DATE=$(date -u -d '1 month ago' +%Y-%m-01)
END_DATE=$(date -u +%Y-%m-01)

echo "================================================"
echo "生成成本报告"
echo "时间范围:$START_DATE 至 $END_DATE"
echo "================================================"

# 1. 按服务分组
echo ""
echo "1. 按服务分组的成本:"
aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --region $REGION \
  --output table

# 2. 按标签分组(团队)
echo ""
echo "2. 按团队分组的成本:"
aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --group-by Type=TAG,Key=Team \
  --region $REGION \
  --output table

# 3. 趋势分析
echo ""
echo "3. 成本趋势(每日):"
aws ce get-cost-and-usage \
  --time-period Start=$START_DATE,End=$END_DATE \
  --granularity DAILY \
  --metrics UnblendedCost \
  --region $REGION

echo ""
echo "================================================"
echo "报告生成完成"
echo "================================================"

最佳实践总结

1. 成本可见性

✓ 实施完整的标签策略
✓ 启用 Cost Allocation Tags
✓ 定期审查 Cost Explorer
✓ 配置成本异常检测
✓ 设置预算和告警

2. 计算优化

✓ 混合使用 On-Demand、RI、Spot
✓ Right-sizing 实例
✓ 使用 Savings Plans
✓ 删除未使用的资源
✓ 使用 Graviton 实例(ARM)

3. 存储优化

✓ 使用 S3 生命周期策略
✓ gp2 迁移到 gp3
✓ 删除未使用的快照和卷
✓ 启用 S3 Intelligent-Tiering
✓ 压缩和去重数据

4. 网络优化

✓ 使用 VPC Endpoints
✓ 减少跨 AZ 流量
✓ 使用 CloudFront CDN
✓ 优化 NAT Gateway 使用
✓ 启用拓扑感知路由

5. FinOps 文化

✓ 成本责任明确到团队
✓ 定期成本审查会议
✓ 成本优化 KPI
✓ 工程师成本意识培养
✓ 自动化成本报告

6. 持续优化

✓ 每月成本审查
✓ 定期清理资源
✓ 监控 RI/SP 利用率
✓ 更新优化策略
✓ 跟踪优化效果

总结

本高可用架构实战教程覆盖了:

  1. 项目规划:需求分析、技术选型
  2. 网络架构:Multi-AZ VPC、子网规划、路由策略
  3. 安全配置:安全组、点对点引用、最小权限
  4. EKS 集群:控制平面、节点组、核心插件
  5. 应用部署:微服务、负载均衡、发布策略
  6. 数据库层:RDS、Redis、DynamoDB、S3
  7. 监控日志:Prometheus、Grafana、ELK、Jaeger
  8. 自动扩展:HPA、VPA、Cluster Autoscaler、备份恢复
  9. 成本优化:标签策略、RI/Spot、资源优化、FinOps

通过本教程的学习和实践,你将掌握在 AWS 上构建企业级高可用架构的完整技能。