Loki 和分布式追踪

Loki 和分布式追踪

本节介绍轻量级日志方案 Loki 和分布式追踪系统 Jaeger。

Loki - 轻量级日志方案

Loki 是 Grafana Labs 开发的日志聚合系统,相比 EFK 更加轻量级。

Loki vs Elasticsearch

特性 Loki Elasticsearch
索引方式 只索引标签 全文索引
存储成本
查询速度 快(标签查询) 快(全文搜索)
资源占用
适用场景 结构化日志 全文搜索

架构

┌─────────────┐
│ Application │
│   (logs)    │
└──────┬──────┘
       │
┌──────▼──────┐
│  Promtail   │ ◄── 收集日志(轻量级)
│ DaemonSet   │
└──────┬──────┘
       │
┌──────▼──────┐
│    Loki     │ ◄── 存储日志(只索引标签)
└──────┬──────┘
       │
┌──────▼──────┐
│  Grafana    │ ◄── 查询展示
└─────────────┘

安装 Loki Stack

# 添加 Grafana Helm 仓库
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# 安装 Loki Stack (包含 Loki + Promtail + Grafana)
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace \
  --set grafana.enabled=true \
  --set prometheus.enabled=true \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi

Promtail 配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-config
  namespace: logging
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://loki:3100/loki/api/v1/push

    scrape_configs:
    # Kubernetes Pod 日志
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
      - role: pod
      
      pipeline_stages:
      - docker: {}
      
      relabel_configs:
      # 只抓取有特定注解的 Pod
      - source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        action: keep
        regex: true
      
      # 添加 namespace 标签
      - source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      
      # 添加 pod 标签
      - source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      
      # 添加 container 标签
      - source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      
      # 添加 app 标签
      - source_labels:
        - __meta_kubernetes_pod_label_app
        target_label: app
      
      # 日志路径
      - replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__

Promtail DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: promtail
  namespace: logging
spec:
  selector:
    matchLabels:
      app: promtail
  template:
    metadata:
      labels:
        app: promtail
    spec:
      serviceAccountName: promtail
      containers:
      - name: promtail
        image: grafana/promtail:2.9.0
        args:
        - -config.file=/etc/promtail/promtail.yaml
        
        volumeMounts:
        - name: config
          mountPath: /etc/promtail
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi
      
      volumes:
      - name: config
        configMap:
          name: promtail-config
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Grafana 中查询 Loki 日志

# 在 Grafana 中添加 Loki 数据源
# Configuration → Data Sources → Add data source → Loki
# URL: http://loki:3100

# LogQL 查询示例:

# 1. 查询特定 namespace 的日志
{namespace="production"}

# 2. 查询特定 app 的日志
{app="myapp"}

# 3. 包含特定文本的日志
{app="myapp"} |= "error"

# 4. 排除特定文本
{app="myapp"} != "debug"

# 5. 正则匹配
{app="myapp"} |~ "error|failed"

# 6. JSON 解析
{app="myapp"} | json | level="error"

# 7. 统计错误率
rate({app="myapp"} |= "error" [5m])

# 8. 多标签查询
{namespace="production", app="myapp", container="api"}

Jaeger - 分布式追踪

Jaeger 是 CNCF 的分布式追踪系统,用于监控微服务架构中的请求流程。

分布式追踪概念

用户请求 → API Gateway → Service A → Service B → Database
           [Span 1]     [Span 2]   [Span 3]   [Span 4]
           └────────────── Trace ID: abc123 ──────────┘
  • Trace: 完整的请求链路
  • Span: 单个操作(如 HTTP 请求、数据库查询)
  • Trace ID: 唯一标识一条 Trace
  • Span ID: 唯一标识一个 Span

安装 Jaeger Operator

# 创建 namespace
kubectl create namespace tracing

# 安装 Jaeger Operator
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n tracing

部署 Jaeger 实例

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: tracing
spec:
  strategy: production
  
  # 存储配置(使用 Elasticsearch)
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch.logging:9200
        index-prefix: jaeger
  
  # Collector 配置
  collector:
    maxReplicas: 5
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  
  # Query UI 配置
  query:
    replicas: 2
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
  
  # Agent 配置
  agent:
    strategy: DaemonSet
    resources:
      limits:
        cpu: 200m
        memory: 128Mi

简化部署(All-in-One)

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger-simple
  namespace: tracing
spec:
  strategy: allInOne
  allInOne:
    image: jaegertracing/all-in-one:latest
    options:
      memory:
        max-traces: 100000
  storage:
    type: memory
  ingress:
    enabled: true

应用集成 OpenTelemetry

OpenTelemetry 是 CNCF 的可观测性框架,统一了指标、日志、追踪。

Node.js 集成

// 安装依赖
// npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-jaeger

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const jaegerExporter = new JaegerExporter({
  endpoint: 'http://jaeger-collector:14268/api/traces',
});

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'myapp',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
  traceExporter: jaegerExporter,
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

// app.js
require('./tracing');
const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  // 自动创建 span
  const users = await fetchUsers();
  res.json(users);
});

app.listen(3000);

手动创建 Span

const opentelemetry = require('@opentelemetry/api');

async function fetchUsers() {
  const tracer = opentelemetry.trace.getTracer('myapp');
  
  // 创建 span
  const span = tracer.startSpan('fetchUsers');
  
  try {
    // 添加属性
    span.setAttribute('db.system', 'mongodb');
    span.setAttribute('db.name', 'users');
    
    // 执行业务逻辑
    const users = await db.collection('users').find().toArray();
    
    // 添加事件
    span.addEvent('users fetched', {
      count: users.length
    });
    
    return users;
  } catch (error) {
    // 记录错误
    span.recordException(error);
    span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
    throw error;
  } finally {
    // 结束 span
    span.end();
  }
}

Go 集成

package main

import (
    "context"
    "log"
    
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() func() {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
    ))
    if err != nil {
        log.Fatal(err)
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("myapp"),
            semconv.ServiceVersionKey.String("1.0.0"),
        )),
    )
    
    otel.SetTracerProvider(tp)
    
    return func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Fatal(err)
        }
    }
}

func main() {
    cleanup := initTracer()
    defer cleanup()
    
    // 应用代码
}

访问 Jaeger UI

# 端口转发
kubectl port-forward -n tracing svc/jaeger-query 16686:16686

# 访问 http://localhost:16686

Jaeger UI 功能

  1. 搜索 Trace: 按服务、操作、标签搜索
  2. 查看详情: 查看 Span 时间线、持续时间
  3. 依赖图: 查看服务间依赖关系
  4. 对比 Trace: 对比不同请求的性能

最佳实践

1. 日志规范

{
  "timestamp": "2024-01-08T12:00:00Z",
  "level": "error",
  "message": "Database connection failed",
  "service": "user-service",
  "trace_id": "abc123",
  "span_id": "def456",
  "user_id": "user123",
  "error": {
    "type": "ConnectionError",
    "message": "Connection timeout after 5s"
  }
}

2. 采样策略

// 生产环境使用采样,减少开销
const sdk = new NodeSDK({
  // 采样 10% 的请求
  sampler: new TraceIdRatioBasedSampler(0.1),
  
  // 或基于父 Span 决策
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(0.1),
  }),
});

3. 关联日志和追踪

const logger = require('winston');
const opentelemetry = require('@opentelemetry/api');

function log(level, message) {
  const span = opentelemetry.trace.getActiveSpan();
  const context = span ? {
    trace_id: span.spanContext().traceId,
    span_id: span.spanContext().spanId,
  } : {};
  
  logger.log(level, message, context);
}

小结

本节介绍了轻量级监控方案:

Loki: 轻量级日志聚合,只索引标签
Promtail: DaemonSet 收集日志
LogQL: Grafana 中查询 Loki 日志
Jaeger: 分布式追踪系统
OpenTelemetry: 统一的可观测性框架
应用集成: Node.js、Go 集成示例

下一节:监控最佳实践。